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Preface 


Chair of Econometrics Department of Business Administration and Economics 
University of Duisburg-Essen Essen, Germany info@econometrics-with-r.org 
Last updated on Tuesday, September 15, 2020 


Over the recent years, the statistical programming language R has become an 
integral part of the curricula of econometrics classes we teach at the University of 
Duisburg-Essen. We regularly found that a large share of the students, especially 
in our introductory undergraduate econometrics courses, have not been exposed 
to any programming language before and thus have difficulties to engage with 
learning R on their own. With little background in statistics and econometrics, it 
is natural for beginners to have a hard time understanding the benefits of having 
R skills for learning and applying econometrics. These particularly include the 
ability to conduct, document and communicate empirical studies and having the 
facilities to program simulation studies which is helpful for, e.g., comprehending 
and validating theorems which usually are not easily grasped by mere brooding 
over formulas. Being applied economists and econometricians, all of the latter 
are capabilities we value and wish to share with our students. 


Instead of confronting students with pure coding exercises and complementary 
classic literature like the book by Venables and Smith (2010), we figured it would 
be better to provide interactive learning material that blends R code with the 
contents of the well-received textbook Introduction to Econometrics by Stock 
and Watson (2015) which serves as a basis for the lecture. This material is 
gathered in the present book Introduction to Econometrics with R, an empirical 
companion to Stock and Watson (2015). It is an interactive script in the style 
of a reproducible research report and enables students not only to learn how 
results of case studies can be replicated with R but also strengthens their ability 
in using the newly acquired skills in other empirical applications. 


Conventions Used in this Book 


e Italic text indicates new terms, names, buttons and alike. 


e Constant width text is generally used in paragraphs to refer to R code. 
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This includes commands, variables, functions, data types, databases and 
file names. 


e Constant width text on gray background indicates R code that can be 
typed literally by you. It may appear in paragraphs for better distin- 
guishability among executable and non-executable code statements but 
it will mostly be encountered in shape of large blocks of R code. These 
blocks are referred to as code chunks. 
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Chapter 1 


Introduction 


The interest in the freely available statistical programming language and soft- 
ware environment R (R Core Team, 2020) is soaring. By the time we wrote 
first drafts for this project, more than 11000 add-ons (many of them providing 
cutting-edge methods) were made available on the Comprehensive R Archive 
Network (CRAN), an extensive network of FTP servers around the world that 
store identical and up-to-date versions of R code and its documentation. R dom- 
inates other (commercial) software for statistical computing in most fields of 
research in applied statistics. The benefits of it being freely available, open 
source and having a large and constantly growing community of users that con- 
tribute to CRAN render R more and more appealing for empirical economists 
and econometricians alike. 


A striking advantage of using R in econometrics is that it enables students to 
explicitly document their analysis step-by-step such that it is easy to update and 
to expand. This allows to re-use code for similar applications with different data. 
Furthermore, R programs are fully reproducible, which makes it straightforward 
for others to comprehend and validate results. 


Over the recent years, R has thus become an integral part of the curricula of 
econometrics classes we teach at the University of Duisburg-Essen. In some 
sense, learning to code is comparable to learning a foreign language and contin- 
uous practice is essential for the learning success. Needless to say, presenting 
bare R code on slides does not encourage the students to engage with hands-on 
experience on their own. This is why R is crucial. As for accompanying liter- 
ature, there are some excellent books that deal with R and its applications to 
econometrics, e.g., Kleiber and Zeileis (2008). However, such sources may be 
somewhat beyond the scope of undergraduate students in economics having little 
understanding of econometric methods and barely any experience in program- 
ming at all. Consequently, we started to compile a collection of reproducible 
reports for use in class. These reports provide guidance on how to implement 
selected applications from the textbook Introduction to Econometrics (Stock 
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and Watson, 2015) which serves as a basis for the lecture and the accompany- 
ing tutorials. This process was facilitated considerably by knitr (Xie, 2020b) 
and R markdown (Allaire et al., 2020). In conjunction, both R packages provide 
powerful functionalities for dynamic report generation which allow to seamlessly 
combine pure text, LaTeX, R code and its output in a variety of formats, includ- 
ing PDF and HTML. Moreover, writing and distributing reproducible reports 
for use in academia has been enriched tremendously by the bookdown package 
(Xie, 2020a) which has become our main tool for this project. bookdown builds 
on top of R markdown and allows to create appealing HTML pages like this one, 
among other things. Being inspired by Using R for Introductory Econometrics 
(Heiss, 2016)! and with this powerful toolkit at hand we wrote up our own em- 
pirical companion to Stock and Watson (2015). The result, which you started 
to look at, is Introduction to Econometrics with R. 


Similarly to the book by Heiss (2016), this project is neither a comprehensive 
econometrics textbook nor is it intended to be a general introduction to R. We 
feel that Stock and Watson do a great job at explaining the intuition and theory 
of econometrics, and at any rate better than we could in yet another introduc- 
tory textbook! Introduction to Econometrics with R is best described as an 
interactive script in the style of a reproducible research report which aims to 
provide students with a platform-independent e-learning arrangement by seam- 
lessly intertwining theoretical core knowledge and empirical skills in undergrad- 
uate econometrics. Of course, the focus is on empirical applications with R. We 
leave out derivations and proofs wherever we can. Our goal is to enable students 
not only to learn how results of case studies can be replicated with R but we 
also intend to strengthen their ability in using the newly acquired skills in other 
empirical applications — immediately within Introduction to Econometrics with 
R. 


To realize this, each chapter contains interactive R programming exercises. 
These exercises are used as supplements to code chunks that display how pre- 
viously discussed techniques can be implemented within R. They are generated 
using the DataCamp light widget and are backed by an R session which is main- 
tained on DataCamp’s servers. You may play around with the example exercise 
presented below. 


As you can see above, the widget consists of two tabs. script.R mimics an .R- 
file, a file format that is commonly used for storing R code. Lines starting with 
a # are commented out, that is, they are not recognized as code. Furthermore, 
script.R works like an exercise sheet where you may write down the solution 
you come up with. If you hit the button Run, the code will be executed, submis- 
sion correctness tests are run and you will be notified whether your approach is 
correct. If it is not correct, you will receive feedback suggesting improvements 
or hints. The other tab, R Console, is a fully functional R console that can be 
used for trying out solutions to exercises before submitting them. Of course 





1Heiss (2016) builds on the popular Introductory Econometrics (Wooldridge, 2016) and 
demonstrates how to replicate the applications discussed therein using R. 
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you may submit (almost any) R code and use the console to play around and 
explore. Simply type a command and hit the Enter key on your keyboard. 


Looking at the widget above, you will notice that there is a > in the right panel 
(in the console). This symbol is called “prompt” and indicates that the user 
can enter code that will be executed. To avoid confusion, we will not show this 
symbol in this book. Output produced by R code is commented out with #>. 


Most commonly we display R code together with the generated output in code 
chunks. As an example, consider the following line of code presented in chunk 
below. It tells R to compute the number of packages available on CRAN. The 
code chunk is followed by the output produced. 


# check the number of R packages available on CRAN 
nrow(available.packages(repos = "http://cran.us.r-project.org")) 
#> [1] 16272 


Each code chunk is equipped with a button on the outer right hand side which 
copies the code to your clipboard. This makes it convenient to work with larger 
code segments in your version of R./ RStudio or in the widgets presented through- 
out the book. In the widget above, you may click on R Console and type 
nrow(available.packages(repos = "http://cran.us.r-project.org")) 
(the command from the code chunk above) and execute it by hitting Enter on 
your keyboard.” 


Note that some lines in the widget are out-commented which ask you to assign 
a numeric value to a variable and then to print the variable’s content to the 
console. You may enter your solution approach to script .R and hit the button 
Run in order to get the feedback described further above. In case you do not 
know how to solve this sample exercise (don’t panic, that is probably why you 
are reading this), a click on Hint will provide you with some advice. If you 
still can’t find a solution, a click on Solution will provide you with another tab, 
Solution.R which contains sample solution code. It will often be the case that 
exercises can be solved in many different ways and Solution.R presents what 
we consider as comprehensible and idiomatic. 


1.1 Colophon 
This book was build with: 


#> = Session. info. =-S+sSsS se sS-sssasa ch Sas aa essa cS SSeS Seh Se SSS Sea easSSSesesaches 
#> setting value 
#> version R version 4.0.2 (2020-06-22) 





2The R session is initialized by clicking into the widget. This might take a few seconds. 
Just wait for the indicator next to the button Run to turn green. 
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#> os macOS Catalina 10.15.4 
#> system x86_64, darwinl9.5.0 
#> ui unknown 


#> language (EN) 
#> collate en_US.UTF-8 
#> ctype en_US .UTF-8 


#> tz Europe/Berlin 

#> date 2020-09-15 

#> 

Hoc Packages tea ee ee ST ee eS 
#> package * version date lib source 

#> abind 1.4-5 2016-07-21 [1] CRAN (R 4.0.2) 
#> AER 1.2-9 2020-02-06 [1] CRAN (R 4.0.0) 
#> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.0) 
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) 
#> backports 1.1.8 2020-06-17 [1] CRAN (R 4.0.0) 
#> base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.0.0) 
#> bdsmatrix 1.3-4 2020-01-13 [1] CRAN (R 4.0.0) 
#> BH 1.72.0-3 2020-01-08 [1] CRAN (R 4.0.2) 
#> bibtex 0.4.2.2 2020-01-02 [1] CRAN (R 4.0.0) 
#> bitops 1.0-6 2013-08-17 [1] CRAN (R 4.0.0) 
#> blob L224. 2020-01-20 [1] CRAN (R 4.0.0) 
#> bookdown 0.20 2020-06-23 [1] CRAN (R 4.0.0) 
#> boot 1.3-25 2020-04-26 [2] CRAN (R 4.0.2) 
#> broom 0.7.0 2020-07-09 [1] CRAN (R 4.0.2) 
#> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0) 
#> car 3.0-8 2020-05-21 [1] CRAN (R 4.0.0) 
#> carData 3.0-4 2020-05-22 [1] CRAN (R 4.0.0) 
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0) 
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0) 
#> clipr 0.7.0 2019-07-23 [1] CRAN (R 4.0.0) 
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0) 
#> conquer 1.0.1 2020-05-06 [1] CRAN (R 4.0.2) 
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0) 
#> cubature 2.0.4.1 2020-07-06 [1] CRAN (R 4.0.2) 
#> curl 4.3 2019-12-02 [1] CRAN (R 4.0.0) 
#> data.table 1.12.8 2019-12-09 [1] CRAN (R 4.0.0) 
#> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.0) 
#> dbplyr 1.4.4 2020-05-27 [1] CRAN (R 4.0.0) 
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) 
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0) 
#> dplyr 1.0.0 2020-05-29 [1] CRAN (R 4.0.0) 
#> dynilm 0.3-6 2019-01-06 [1] CRAN (R 4.0.2) 
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0) 
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0) 
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0) 
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#> farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.0) 
#> fastICA 1.2-2 2019-07-08 [1] CRAN (R 4.0.2) 
#> fBasics 3042.89.14 2020-03-07 [1] CRAN (R 4.0.2) 
#> fGarch 3042 .83.2 2020-03-07 [1] CRAN (R 4.0.2) 
#> forcats 0.5.0 2020-03-01 [1] CRAN (R 4.0.0) 
#> foreign 0.8-80 2020-05-24 [2] CRAN (R 4.0.2) 
#> Formula 1.2-3 2018-05-03 [1] CRAN (R 4.0.0) 
#> fs 1.4.2 2020-06-30 [1] CRAN (R 4.0.2) 
#> gbRd 0.4-11 2012-10-01 [1] CRAN (R 4.0.0) 
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0) 
#> ggplot2 3:3-2 2020-06-19 [1] CRAN (R 4.0.0) 
#> glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0) 
#> gss 2.2-2 2020-05-26 [1] CRAN (R 4.0.2) 
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0) 
#> haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.0) 
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0) 
#> hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.0) 
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.0) 
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2) 
#> isoband 0.2.2 2020-06-20 [1] CRAN (R 4.0.0) 
#> itewrpkg 0.0.0.9000 2020-07-28 [1] Github (mca91/itewrpkg@bf5448c) 
#> jsonlite 1.7.0 2020-06-25 [1] CRAN (R 4.0.0) 
#> KernSmooth 2.23-17 2020-04-26 [2] CRAN (R 4.0.2) 
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.0) 
#> labeling 0.3 2014-08-23 [1] CRAN (R 4.0.0) 
#> lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2) 
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) 
#> lme4 1.1-23 2020-04-07 [1] CRAN (R 4.0.0) 
#> Imtest 0.9-37 2019-04-30 [1] CRAN (R 4.0.2) 
#> locpol 0.7-0 2018-05-24 [1] CRAN (R 4.0.0) 
#> lubridate 1.7.9 2020-06-08 [1] CRAN (R 4.0.0) 
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0) 
#> maptools 1.0-1 2020-05-14 [1] CRAN (R 4.0.0) 
#> markdown 1.1 2019-08-07 [1] CRAN (R 4.0.0) 
#> MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2) 
#> Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2) 
#> MatrixModels 0.4-1 2015-08-22 [1] CRAN (R 4.0.0) 
#> matrixStats 0.56.0 2020-03-13 [1] CRAN (R 4.0.2) 
#> maxLik 1.3-8 2020-01-10 [1] CRAN (R 4.0.0) 
#> mgcv 1.8-31 2019-11-09 [2] CRAN (R 4.0.2) 
#> mime 0.9 2020-02-04 [1] CRAN (R 4.0.0) 
#> minga 1.2.4 2014-10-09 [1] CRAN (R 4.0.0) 
#> miscTools 0.6-26 2019-12-08 [1] CRAN (R 4.0.0) 
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.0) 
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0) 
#> mvtnorm 1.1-1 2020-06-09 [1] CRAN (R 4.0.2) 
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#> SparseM 1.78 2019-12-13 [1] CRAN (R 4.0.2) 
#> spatial 7.3-12 2020-04-26 [2] CRAN (R 4.0.2) 
#> stabledist 0.7-1 2016-09-12 [1] CRAN (R 4.0.2) 
#> stargazer 5.2.2 2018-05-30 [1] CRAN (R 4.0.2) 
#> statmod 1.4.34 2020-02-17 [1] CRAN (R 4.0.0) 
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0) 
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) 
#> strucchange 1.5-2 2019-10-12 [1] CRAN (R 4.0.2) 
#> survival 3.2-3 2020-06-13 [2] CRAN (R 4.0.2) 
#> sys 3.4 2020-07-23 [1] CRAN (R 4.0.2) 
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0) 
#> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2) 
#> tidyr 1.1.0 2020-05-20 [1] CRAN (R 4.0.0) 
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0) 
#> tidyverse 1.3.0 2019-11-21 [1] CRAN (R 4.0.0) 
#> timeDate 3043.102 2018-02-21 [1] CRAN (R 4.0.0) 
#> timeSeries 3062.100 2020-01-24 [1] CRAN (R 4.0.2) 
#> tinytex 0.25 2020-07-24 [1] CRAN (R 4.0.2) 
#> TTR 0.23-6 2019-12-15 [1] CRAN (R 4.0.2) 
#> urca 1.3-0 2016-09-06 [1] CRAN (R 4.0.2) 
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0) 
#> vars 1.5-3 2018-08-06 [1] CRAN (R 4.0.2) 
#> vctrs 0.3.2 2020-07-15 [1] CRAN (R 4.0.2) 
#> viridisLite 0.3.0 2018-02-01 [1] CRAN (R 4.0.0) 
#> whisker 0.4 2019-08-28 [1] CRAN (R 4.0.0) 
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0) 
#> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2) 
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0) 
#> xts 0.12-0 2020-01-19 [1] CRAN (R 4.0.0) 
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0) 
#> zip 2.0.4 2019-09-01 [1] CRAN (R 4.0.0) 
#> zoo 1.8-8 2020-05-02 [1] CRAN (R 4.0.0) 
#> 


#> [1] /usr/local/lib/R/4.0/site-library 
#> [2] /usr/local/Cellar/r/4.0.2_1/lib/R/library 


1.2 A Very Short Introduction to R and RStudio 


R Basics 


As mentioned before, this book is not intended to be an introduction to R but 
a guide on how to use its capabilities for applications commonly encountered in 
undergraduate econometrics. Those having basic knowledge in R programming 
will feel comfortable starting with Chapter 2. This section, however, is meant 
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Figure 1.1: RStudio: the four panes 


for those who have not worked with R or RStudio before. If you at least know 
how to create objects and call functions, you can skip it. If you would like to 
refresh your skills or get a feeling for how to work with RStudio, keep reading. 


First of all, start RStudio and open a new R script by selecting File, New File, 
R Script. In the editor pane, type 


ab eel 


and click on the button labeled Run in the top right corner of the editor. By 
doing so, your line of code is sent to the console and the result of this operation 
should be displayed right underneath it. As you can see, R works just like a 
calculator. You can do all arithmetic calculations by using the corresponding 
operator (+, -, *, / or ^). If you are not sure what the last operator does, try 
it out and check the results. 


Vectors 


R is of course more sophisticated than that. We can work with variables or, 
more generally, objects. Objects are defined by using the assignment operator 
<-. To create a variable named x which contains the value 10 type x <- 10 
and click the button Run yet again. The new variable should have appeared in 
the environment pane on the top right. The console however did not show any 
results, because our line of code did not contain any call that creates output. 
When you now type x in the console and hit return, you ask R to show you the 
value of x and the corresponding value should be printed in the console. 


x is a scalar, a vector of length 1. You can easily create longer vectors by using 
the function c() (c is for “concatenate” or “combine”). To create a vector y 
containing the numbers 1 to 5 and print it, do the following. 
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You can also create a vector of letters or words. For now just remember that 
characters have to be surrounded by quotes, else they will be parsed as object 
names. 


hello <- c("Hello", "World") 


Here we have created a vector of length 2 containing the words Hello and World. 


Do not forget to save your script! To do so, select File, Save. 


Functions 


You have seen the function c() that can be used to combine objects. In general, 
all function calls look the same: a function name is always followed by round 
parentheses. Sometimes, the parentheses include arguments. 


Here are two simple examples. 


# generate the vector “z` 
z <- seq(from = 1, to = 5, by = 1) 


# compute the mean of the enries in `z` 
mean (z) 
#> [1] 3 


In the first line we use a function called seq() to create the exact same vector 
as we did in the previous section, calling it z. The function takes on the argu- 
ments from, to and by which should be self-explanatory. The function mean() 
computes the arithmetic mean of its argument x. Since we pass the vector z as 
the argument x, the result is 3! 


If you are not sure which arguments a function expects, you may consult the 
function’s documentation. Let’s say we are not sure how the arguments required 
for seq() work. We then type ?seq in the console. By hitting return the 
documentation page for that function pops up in the lower right pane of RStudio. 
In there, the section Arguments holds the information we seek. On the bottom 
of almost every help page you find examples on how to use the corresponding 
functions. This is very helpful for beginners and we recommend to look out for 
those. 


Of course, all of the commands presented above also work in interactive widgets 
throughout the book. You may try them below. 
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Chapter 2 


Probability Theory 


This chapter reviews some basic concepts of probability theory and demonstrates 
how they can be applied in R. 


Most of the statistical functionalities in base R are collected in the stats pack- 
age. It provides simple functions which compute descriptive measures and fa- 
cilitate computations involving a variety of probability distributions. It also 
contains more sophisticated routines that, e.g., enable the user to estimate a 
large number of models based on the same data or help to conduct extensive 
simulation studies. stats is part of the base distribution of R, meaning that it 
is installed by default so there is no need to run install.packages ("stats") 
or library("stats"). Simply execute library(help = "stats") in the con- 
sole to view the documentation and a complete list of all functions gathered in 
stats. For most packages a documentation that can be viewed within RStudio 
is available. Documentations can be invoked using the ? operator, e.g., upon 
execution of ?stats the documentation of the stats package is shown in the 
help tab of the bottom-right pane. 


In what follows, our focus is on (some of) the probability distributions that 
are handled by R and show how to use the relevant functions to solve simple 
problems. Thereby, we refresh some core concepts of probability theory. Among 
other things, you will learn how to draw random numbers, how to compute 
densities, probabilities, quantiles and alike. As we shall see, it is very convenient 
to rely on these routines. 


2.1 Random Variables and Probability Distri- 
butions 


Let us briefly review some basic concepts of probability theory. 
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e The mutually exclusive results of a random process are called the out- 
comes. ‘Mutually exclusive’ means that only one of the possible outcomes 
can be observed. 

e We refer to the probability of an outcome as the proportion that the out- 
come occurs in the long run, that is, if the experiment is repeated many 
times. 

e The set of all possible outcomes of a random variable is called the sample 
Space. 

e An event is a subset of the sample space and consists of one or more 
outcomes. 


These ideas are unified in the concept of a random variable which is a numerical 
summary of random outcomes. Random variables can be discrete or continuous. 


e Discrete random variables have discrete outcomes, e.g., 0 and 1. 
e A continuous random variable may take on a continuum of possible values. 


Probability Distributions of Discrete Random Variables 


A typical example for a discrete random variable D is the result of a dice roll: in 
terms of a random experiment this is nothing but randomly selecting a sample 
of size 1 from a set of numbers which are mutually exclusive outcomes. Here, 
the sample space is {1,2,3,4,5,6} and we can think of many different events, 
e.g., ‘the observed outcome lies between 2 and 5’. 


A basic function to draw random samples from a specified set of elements is the 
function sample(), see ?sample. We can use it to simulate the random outcome 
of a dice roll. Let’s roll the dice! 


sample(1:6, 1) 
#> [1] 4 


The probability distribution of a discrete random variable is the list of all possi- 
ble values of the variable and their probabilities which sum to 1. The cumulative 
probability distribution function gives the probability that the random variable 
is less than or equal to a particular value. 


For the dice roll, the probability distribution and the cumulative probability 
distribution are summarized in Table 2.1. 


We can easily plot both functions using R. Since the probability equals 1/6 for 
each outcome, we set up the vector probability by using the function rep() 
which replicates a given value a specified number of times. 
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Table 2.1: PDF and CDF of a Dice Roll 





























Outcome 1 2 3 4 5 6 
Probability 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 
Cumulative Probability | 1/6 | 2/6 | 3/6 | 4/6 | 5/6 | 1 





# generate the vector of probabilities 
probability <- rep(1/6, 6) 


# plot the probabilities 














plot (probability, 
xlab = "outcomes", 
main = "Probability Distribution") 
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outcomes 


For the cumulative probability distribution we need the cumulative probabilities, 
i.e., we need the cumulative sums of the vector probability. These sums can 
be computed using cumsum(). 


# generate the vector of cumulative probabilities 
cum_probability <- cumsum(probability) 


# plot the probabilites 
plot (cum_probability, 
xlab = "outcomes", 
main = "Cumulative Probability Distribution") 
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Cumulative Probability Distribution 
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Bernoulli Trials 


The set of elements from which sample() draws outcomes does not have to 
consist of numbers only. We might as well simulate coin tossing with outcomes 
H (heads) and T (tails). 


sample (Guna “TJ, 1) 
#> (ey "nT" 


The result of a single coin toss is a Bernoulli distributed random variable, i.e., 
a variable with two possible distinct outcomes. 


Imagine you are about to toss a coin 10 times in a row and wonder how likely 
it is to end up with a 5 times heads. This is a typical example of what we 
call a Bernoulli experiment as it consists of n = 10 Bernoulli trials that are 
independent of each other and we are interested in the likelihood of observing 
k = 5 successes H that occur with probability p = 0.5 (assuming a fair coin) in 
each trial. Note that the order of the outcomes does not matter here. 


It is a well known result that the number of successes k in a Bernoulli experiment 
follows a binomial distribution. We denote this as 


k ~ B(n,p). 
The probability of observing k successes in the experiment B(n, p) is given by 


n n! 


f(k) = P(k) = o wela e a a 


with o the binomial coefficient. 
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In R, we can solve problems like the one stated above by means of the function 
dbinom() which calculates P(k|n, p) the probability of the binomial distribution 
given the parameters x (k), size (n), and prob (p), see ?dbinom. Let us compute 
P(k = 5|n = 10, p = 0.5) (we write this short as P(k = 5).) 


dbinom(x = 5, 
size = 10, 
prob = 0.5) 
#> [1] 0.2460938 


We conclude that P(k = 5), the probability of observing Head k = 5 times when 
tossing a fair coin n = 10 times is about 24.6%. 


Now assume we are interested in P(4 < k < 7), i.e., the probability of observing 
4, 5, 6 or 7 successes for B(10,0.5). This may be computed by providing a 
vector as the argument x in our call of dbinom() and summing up using sum(). 


# compute P(4 <= k <= 7) using 'dbinom() ' 
sum(dbinom(x = 4:7, size = 10, prob = 0.5)) 
#> [1] 0.7734375 


An alternative approach is to use pbinom(), the distribution function of the 
binomial distribution to compute 


P(A<k<7)=P(k<7)—P(k<3). 


# compute P(4 <= k <= 7) using 'pbinom()' 
pbinom(size = 10, prob = 0.5, q = 7) - pbinom(size = 10, prob = 0.5, q = 3) 
#> [1] ORT TS4S 75: 


The probability distribution of a discrete random variable is nothing but a list 
of all possible outcomes that can occur and their respective probabilities. In the 
coin tossing example we have 11 possible outcomes for k. 


# set up vector of possible outcomes 
K< 10410 

k 

H OT ee 4 i YU et OE) oe 


To visualize the probability distribution function of k we may therefore do the 
following: 
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# assign the probabilities 
probability <- dbinom(x = k, 
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size = 10, 
prob = 0.5) 


# plot the outcomes against their probabilities 


plot(x = k, 
y = probability, 


main = "Probability Distribution Function") 


Probability Distribution Function 
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In a similar fashion we may plot the cumulative distribution function of k by 


executing the following code chunk: 


# compute cumulative probabilities 


prob <- pbinom(q = k, 
size = 10, 
prob = 0.5) 


# plot the cumulative probabilities 


plot(x =k, 
y = prob, 


main = "Cumulative Distribution Function") 
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Cumulative Distribution Function 
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Expected Value, Mean and Variance 


The expected value of a random variable is, loosely, the long-run average value of 
its outcomes when the number of repeated trials is large. For a discrete random 
variable, the expected value is computed as a weighted average of its possible 
outcomes whereby the weights are the related probabilities. This is formalized 
in Key Concept 2.1. 


Key Concept 2.1 
Expected Value and the Mean 


Suppose the random variable Y takes on k possible values, y1,..., yx, 
where yı denotes the first value, y2 denotes the second value, and so 
forth, and that the probability that Y takes on yı is pı, the probability 
that Y takes on y2 is pe and so forth. The expected value of Y, E(Y) is 
defined as 


k 


E(Y) = yipi + Y2P2 + +--+ YkPk = N YiPi 
i=l 


where the notation Da yip; means the sum of y; p; for i running from 
1 to k: The expected value of Y is also called the mean of Y or the 
expectation of Y and is denoted by py. 





In the dice example, the random variable, D say, takes on 6 possible values 
dı = 1, d2 = 2,...,dg = 6. Assuming a fair dice, each of the 6 outcomes occurs 
with a probability of 1/6. It is therefore easy to calculate the exact value of 
E(D) by hand: 
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E(D) is simply the average of the natural numbers from 1 to 6 since all weights 
pi are 1/6. This can be easily calculated using the function mean() which 
computes the arithmetic mean of a numeric vector. 


# compute mean of natural numbers from 1 to 6 
mean (1:6) 
#> [1] 32.5 


An example of sampling with replacement is rolling a dice three times in a row. 


# set seed for reproducibility 
set.seed(1) 


# rolling a dice three times in a row 
sample(1:6, 3, replace = T) 
#> [1] 1 4 1 


Note that every call of sample(1:6, 3, replace = T) gives a different out- 
come since we draw with replacement at random. To allow you to reproduce the 
results of computations that involve random numbers, we will used set.seed() 
to set R’s random number generator to a specific state. You should check that 
it actually works: set the seed in your R session to 1 and verify that you obtain 
the same three random numbers! 
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Sequences of random numbers generated by R are pseudo-random num- 
bers, i.e., they are not “truly” random but approximate the properties 
of sequences of random numbers. Since this approximation is good 
enough for our purposes we refer to pseudo-random numbers as random 
numbers throughout this book. 

In general, sequences of random numbers are generated by functions 
called “pseudo-random number generators” (PRNGs). The PRNG in R 
works by performing some operation on a deterministic value. Generally, 
this value is the previous number generated by the PRNG. However, the 
first time the PRNG is used, there is no previous value. A “seed” is the 
first value of a sequence of numbers — it initializes the sequence. Each 
seed value will correspond to a different sequence of values. In R a seed 
can be set using set.seed(). 

This is convenient for us: 

If we provide the same seed twice, we get the same sequence of numbers 
twice. Thus, setting a seed before executing R code which involves 
random numbers makes the outcome reproducible! 











Of course we could also consider a much bigger number of trials, 10000 say. 
Doing so, it would be pointless to simply print the results to the console: by 
default R displays up to 1000 entries of large vectors and omits the remainder 
(give it a try). Eyeballing the numbers does not reveal much. Instead, let us 
calculate the sample average of the outcomes using mean() and see if the result 
comes close to the expected value E(D) = 3.5. 


# set seed for reproducibility 
set.seed(1) 


# compute the sample mean of 10000 dice rolls 
mean(sample(1:6, 

10000, 

replace = T)) 
#> [1] 3.5138 


We find the sample mean to be fairly close to the expected value. This result 
will be discussed in Chapter 2.2 in more detail. 


Other frequently encountered measures are the variance and the standard devi- 
ation. Both are measures of the dispersion of a random variable. 
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Key Concept 2.2 
Variance and Standard Deviation 
The variance of the discrete random variable Y, denoted o%,, is 


k 


Gy = Var(Y) = E [(Y — py)*] = X (vi — uy ri 


i 


The standard deviation of Y is oy, the square root of the variance. The 
units of the standard deviation are the same as the units of Y. 





The variance as defined in Key Concept 2.2, being a population quantity, is 
not implemented as a function in R. Instead we have the function var() which 
computes the sample variance 





Remember that sẹ is different from the so called population variance of a dis- 
crete random variable Y, 


1 N 
Var(Y) = diy — py)? 


z| 


since it measures how the n observations in the sample are dispersed around the 
sample average y. Instead, Var(Y) measures the dispersion of the whole popu- 
lation (N members) around the population mean py. The difference becomes 
clear when we look at our dice rolling example. For D we have 


6 
Var(D) = 1/6 X (d; — 3.5)? = 2.92 


i=l 


which is obviously different from the result of s? as computed by var(). 


var (1:6) 
#2 A5 


The sample variance as computed by var() is an estimator of the population 
variance. You may check this using the widget below. 
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Probability Distributions of Continuous Random Variables 


Since a continuous random variable takes on a continuum of possible values, we 
cannot use the concept of a probability distribution as used for discrete random 
variables. Instead, the probability distribution of a continuous random variable 
is summarized by its probability density function (PDF). 


The cumulative probability distribution function (CDF) for a continuous ran- 
dom variable is defined just as in the discrete case. Hence, the CDF of a con- 
tinuous random variables states the probability that the random variable is less 
than or equal to a particular value. 


For completeness, we present revisions of Key Concepts 2.1 and 2.2 for the 
continuous case. 


Key Concept 2.3 
Probabilities, Expected Value and Variance of a Continuous 
Random Variable 


Let fy (y) denote the probability density function of Y. The Probability 
that Y falls between a and b where a < b is 


We further have that P(— Z = = 1 and therefore 


As 

As for the discrete case, the expected value of Y is the probability 
weighted average of its values. Due to continuity, we use integrals instead 
of sums. The expected value of Y is defined as 


E(Y) = py = Juta. 


The variance is the expected value of (Y — py). We thus have 


Var =o = fo — py)? fy (y)dy. 





Let us discuss an example: 


Consider the continuous random variable X with PDF 


3 
fx(x) = 7p > 1. 


e We can show analytically that the integral of fx(x) over the real line 
equals 1. 
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fioa =f Ždár (2.1) 
t 

= lim =g °t (2.3) 

li : 1 2.4 

=e ae 

=1 (2.5) 


e The expectation of X can be computed as follows: 


E(X) = fe - fx (a)dx =| a . Žad (2.6) 
=- S07, (2.7) 
= Gee 
= 5 (im po 1) e 
3 
=} (2.9) 


e Note that the variance of X can be expressed as Var(X) = E(X?)—E(X)?. 
Since E(X) has been computed in the previous step, we seek E(X?): 


SF 3 
E(X?) = le - fx (a)dx =| gz. zadr (2.10) 
= 32712, (2.11) 
=-3 (m : te 1) (2.12) 
=3 (2.13) 


So we have shown that the area under the curve equals one, that the expectation 
is E(X) = 3 and we found the variance to be Var(X) = 3. However, this was 
tedious and, as we shall see, an analytic approach is not applicable for some 
PDFs, e.g., if integrals have no closed form solutions. 


Luckily, R also enables us to easily find the results derived above. The tool we 
use for this is the function integrate(). First, we have to define the functions 
we want to calculate integrals for as R functions, i.e., the PDF fx(z) as well as 
the expressions x - fx (a) and x? - fx(z). 
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# define functions 

fe functiona) ge / a4 

g <- function(x) x * f(x) 
h <- function(x) x°2 * f(x) 


Next, we use integrate() and set lower and upper limits of integration to 1 and 
oo using arguments lower and upper. By default, integrate() prints the result 
along with an estimate of the approximation error to the console. However, the 
outcome is not a numeric value one can readily do further calculation with. In 
order to get only a numeric value of the integral, we need to use the \$ operator 
in conjunction with value. The \$ operator is used to extract elements by name 
from an object of type list. 


# compute area under the density curve 
area <- integrate(f, 


lower = i, 

upper = Inf)$value 
area 
#> [1] 1 
# compute E(X) 
EX <- integrate(g, 

lower = 1, 

upper = Inf) $value 
EX 
#> [I] 1.5 


# compute Var(X) 
VarX <- integrate(h, 
lower = 1, 
upper = Inf)$value - EX^2 


VarX 
#> M] 0575 


Although there is a wide variety of distributions, the ones most often encoun- 
tered in econometrics are the normal, chi-squared, Student t and F distributions. 
Therefore we will discuss some core R functions that allow to do calculations 
involving densities, probabilities and quantiles of these distributions. 


Every probability distribution that R handles has four basic functions whose 
names consist of a prefix followed by a root name. As an example, take the 
normal distribution. The root name of all four functions associated with the 
normal distribution is norm. The four prefixes are 


e d for “density” - probability function / probability density function 
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e p for “probability” - cumulative distribution function 

e q for “quantile” - quantile function (inverse cumulative distribution func- 
tion) 

e r for “random” - random number generator 


Thus, for the normal distribution we have the R functions dnorm(), pnorm(), 
qnorm() and rnorm(). 


The Normal Distribution 


The probably most important probability distribution considered here is the 
normal distribution. This is not least due to the special role of the standard nor- 
mal distribution and the Central Limit Theorem which is to be treated shortly. 
Normal distributions are symmetric and bell-shaped. A normal distribution is 
characterized by its mean p and its standard deviation ø, concisely expressed 
by N(u,07). The normal distribution has the PDF 





exp —(a — u)? /(207). (2.14) 


For the standard normal distribution we have u = 0 and o = 1. Standard 
normal variates are often denoted by Z. Usually, the standard normal PDF is 
denoted by ¢ and the standard normal CDF is denoted by ®. Hence, 


oc) = O(c) , O(c) =P(Z<c) , Z~N(0,1). 


Note that the notation X ~ Y reads as “X is distributed as Y”. In R, we can con- 
veniently obtain densities of normal distributions using the function dnorm(). 
Let us draw a plot of the standard normal density function using curve() to- 
gether with dnorm(). 


# draw a plot of the N(0,1) PDF 
curve (dnorm(x) , 
xlim = c(-3.5, 3.5), 
ylab = "Density", 
main = "Standard Normal Density Function") 
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Standard Normal Density Function 
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We can obtain the density at different positions by passing a vector to dnorm(). 


# compute density at z=-1.96, T=0 and T=1.96 
dnorm(x = c(-1.96, 0, 1.96)) 
#> [1] 0.05844094 0.39894228 0.05844094 


Similar to the PDF, we can plot the standard normal CDF using curve(). We 
could use dnorm() for this but it is much more convenient to rely on pnorm(). 


# plot the standard normal CDF 
curve (pnorm(x), 
salami = oea Sy 
ylab = "Probability", 
main = "Standard Normal Cumulative Distribution Function") 


Standard Normal Cumulative Distribution Function 
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We can also use R to calculate the probability of events associated with a stan- 
dard normal variate. 


Let us say we are interested in P(Z < 1.337). For some continuous random 
variable Z on [—0o, oo] with density g(x) we would have to determine G(x), the 
anti-derivative of g(x) so that 


1.337 
P(Z < 1.337) = G(1.337) = J aia 

If Z ~ N(0,1), we have g(x) = ¢(x). There is no analytic solution to the 
integral above. Fortunately, R offers good approximations. The first approach 
makes use of the function integrate() which allows to solve one-dimensional 
integration problems using a numerical method. For this, we first define the 
function we want to compute the integral of as an R function f. In our example, 
f is the standard normal density function and hence takes a single argument x. 
Following the definition of ¢(a) we define f as 


# define the standard normal PDF as an R function 
f <- function(x) { 

1/(sqrt(2 * pi)) * exp(-0.5 * x^2) 
a 


Let us check if this function computes standard normal densities by passing a 
vector. 


# define a vector of reals 
quants << c(-1.96, 0, 1.96) 


# compute densities 
f (quants) 
#> [1] 0.05844094 0.39894228 0.05844094 


# compare to the results produced by 'dnorm()' 
f (quants) == dnorm(quants) 
#> [1] TRUE TRUE TRUE 


The results produced by f () are indeed equivalent to those given by dnorm(). 


Next, we call integrate() on f () and specify the arguments lower and upper, 
the lower and upper limits of integration. 


# integrate fO 
integrate (f, 
lower = -Inf, 
upper = 1.337) 
#> 0.9093887 with absolute error < 1.7e-07 
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We find that the probability of observing Z < 1.337 is about 90.94%. 


A second and much more convenient way is to use the function pnorm(), the 
standard normal cumulative distribution function. 


# compute the probability using pnorm() 
pnorm(1.337) 
#> [1] 0.9093887 


The result matches the outcome of the approach using integrate(). 
Let us discuss some further examples: 


A commonly known result is that 95% probability mass of a standard normal lies 
in the interval [—1.96, 1.96], that is, in a distance of about 2 standard deviations 
to the mean. We can easily confirm this by calculating 


P(-1.96 < Z < 1.96) = 1—2 x P(Z < 1.96) 


due to symmetry of the standard normal PDF. Thanks to R, we can abandon the 
table of the standard normal CDF found in many other textbooks and instead 
solve this fast by using pnorm(). 


# compute the probability 
1 - 2 * (pnorm(-1.96)) 
#> [1] 0.9500042 


To make statements about the probability of observing outcomes of Y in some 
specific range it is more convenient when we standardize first as shown in Key 
Concept 2.4. 


Key Concept 2.4 
Computing Probabilities Involving Normal Random Variables 


Suppose Y is normally distributed with mean p and variance g?: 


Y ~N(u,07) 


Then Y is standardized by subtracting its mean and dividing by its 
standard deviation: 
Yop 


oO 


Z= 


Let cı and c2 denote two numbers whereby cı < cz and further dı = 
(cı — )/o and dz = (c2 — u) /o. Then 
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Now consider a random variable Y with Y ~ N (5,25). R functions that handle 
the normal distribution can perform the standardization. If we are interested 
in P(3 < Y < 4) we can use pnorm() and adjust for a mean and/or a standard 
deviation that deviate from u = 0 and ø = 1 by specifying the arguments 
mean and sd accordingly. Attention: the argument sd requires the standard 
deviation, not the variance! 


pnorm(4, mean = 5, sd = 5) - pnorm(3, mean = 5, sd = 5) 
#> [i] 007616203 


An extension of the normal distribution in a univariate setting is the multivariate 
normal distribution. The joint PDF of two random normal variables X and Y 
is given by 


1 


9x,y (X,Y) = 
2nroyoy«/l— pay 
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(2.15) 


Equation (2.15) contains the bivariate normal PDF. It is somewhat hard to gain 
insights from this complicated expression. Instead, let us consider the special 
case where X and Y are uncorrelated standard normal random variables with 
densities fx(x) and fy(y) with joint normal distribution. We then have the 
parameters ox = oy = 1, ux = py = 0 (due to marginal standard normality) 
and pxy = 0 (due to independence). The joint density of X and Y then becomes 











ax (ea) = flora) = saved [ty] b, ea 


the PDF of the bivariate standard normal distribution. The widget below pro- 
vides an interactive three-dimensional plot of (2.2). 


By moving the cursor over the plot you can see that the density is rotationally 
invariant, i.e., the density at (a,b) solely depends on the distance of (a, b) to the 
origin: geometrically, regions of equal density are edges of concentric circles in 
the XY-plane, centered at (ux = 0, py = 0). 


The normal distribution has some remarkable characteristics. For example, for 
two jointly normally distribued variables X and Y, the conditional expectation 
function is linear: one can show that 


E(Y|X) = E(Y) + po (x -— E(X)). 
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The interactive widget below shows standard bivariate normally distributed 
sample data along with the conditional expectation function E(Y|X) and the 
marginal densities of X and Y. All elements adjust accordingly as you vary the 
parameters. 


This interactive part of the book is only available in the HTML version. 


The Chi-Squared Distribution 


The chi-squared distribution is another distribution relevant in econometrics. It 
is often needed when testing special types of hypotheses frequently encountered 
when dealing with regression models. 


The sum of M squared independent standard normal distributed random vari- 
ables follows a chi-squared distribution with M degrees of freedom: 


M 
z? badade Zi, = 5 Z2, ~ X with Zm rA N (0,1) 
m=1 


A x? distributed random variable with M degrees of freedom has expectation 
M, mode at M — 2 for M > 2 and variance 2. M. For example, for 


Za, Z2, Z3 x N(0, 1) 


it holds that 


Z? + 234 Z3 ~ x3. (2.3) 


Using the code below, we can display the PDF and the CDF of a y3 random 
variable in a single plot. This is achieved by setting the argument add = TRUE in 
the second call of curve (). Further we adjust limits of both axes using xlim and 
ylim and choose different colors to make both functions better distinguishable. 
The plot is completed by adding a legend with help of legend(). 


# plot the PDF 
curve(dchisq(x, df = 3), 
xJaim = 60. LOE 
ylim = c(0, 1), 
col = "blue", 
ylab = "", 
main = "p.d.f. and c.d.f of Chi-Squared Distribution, M = 3") 
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# add the CDF to the plot 
curve(pchisq(x, df = 3), 
slam) = COMO). 
add = TRUE, 
col = "red") 


# add a legend to the plot 
legend("topleft", 
c("PDF", "CDF"), 
col =" eo @lipliel redt), 
ity Gl, al) 
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p.d.f. and c.d.f of Chi-Squared Distribution, M = 3 
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Since the outcomes of a y%, distributed random variable are always positive, 
the support of the related PDF and CDF is Ryo. 


As expectation and variance depend (solely!) on the degrees of freedom, the 
distribution’s shape changes drastically if we vary the number of squared stan- 
dard normals that are summed up. This relation is often depicted by overlaying 
densities for different M, see the Wikipedia Article. 


We reproduce this here by plotting the density of the x? distribution on the 
interval [0,15] with curve(). In the next step, we loop over degrees of freedom 
M = 2,...,7 and add a density curve for each M to the plot. We also adjust 
the line color for each iteration of the loop by setting col = M. At last, we add 
a legend that displays degrees of freedom and the associated colors. 


# plot the density for M=1 
curve(dchisq(x, df = 1), 
slam: = C(O 15), 
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zlab = "x, 
ylab = "Density", 
main = "Chi-Square Distributed Random Variables") 


# add densities for M=2,...,7 to the plot using a 'for()' loop 
for (M in 2:7) { 
curve(dchisq(x, df = M), 
salain a eO ee 


# add a legend 

legend("topright", 
as.character(1:7), 
Coltri; 





Density 














Increasing the degrees of freedom shifts the distribution to the right (the mode 
becomes larger) and increases the dispersion (the distribution’s variance grows). 


The Student t Distribution 


Let Z be a standard normal variate, W a y%, random variable and further 
assume that Z and W are independent. Then it holds that 


—— = X nt 
W/M M 
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and X follows a Student t distribution (or simply t distribution) with M degrees 
of freedom. 


Similar to the x7, distribution, the shape of a tm distribution depends on M. 
t distributions are symmetric, bell-shaped and look similar to a normal distri- 
bution, especially when M is large. This is not a coincidence: for a sufficiently 
large M, the tm distribution can be approximated by the standard normal dis- 
tribution. This approximation works reasonably well for M > 30. As we will 
illustrate later by means of a small simulation study, the t. distribution is the 
standard normal distribution. 


A tm distributed random variable X has an expectation if M > 1 and it has a 
variance if M > 2. 


E(X)=0, M>1 (2.16) 





Var(X) M>2 (2.17) 


a ae 


Let us plot some t distributions with different M and compare them to the 
standard normal distribution. 


# plot the standard normal density 
curve (dnorm(x) , 
xlim = c(-4, 4), 


zlab = "xl", 

lty = 2; 

ylab = YDensityi 

main = "Densities of t Distributions") 


# plot the t density for M=2 
curve(dt(x, df = 2), 
xlim = c(-4, 4), 


# plot the t density for M=4 
curve(dt(x, df = 4), 

xlim = c(-4, 4), 

col = 3, 

addi =T) 


# plot the t density for M=25 
curve(dt(x, df = 25), 

zlim e (475 4s 

col = 4, 
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add = T) 


# add a legend 

legend("topright", 
cC("N(CO, 1)", "M=2", "M=4", "M=25"), 
col = 1:4, 
ity = C5 aly thy ab) 


Densities of t Distributions 
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The plot illustrates what has been said in the previous paragraph: as the degrees 
of freedom increase, the shape of the t distribution comes closer to that of a 
standard normal bell curve. Already for M = 25 we find little difference to the 
standard normal density. If M is small, we find the distribution to have heavier 
tails than a standard normal, i.e., it has a “fatter” bell shape. 


The F Distribution 


Another ratio of random variables important to econometricians is the ratio 
of two independent x? distributed random variables that are divided by their 
degrees of freedom M and n. The quantity 


W/M 
V/n 
follows an F distribution with numerator degrees of freedom M and denominator 


degrees of freedom n, denoted Fm,n. The distribution was first derived by 
George Snedecor but was named in honor of Sir Ronald Fisher. 


~ Fun with W xxu , Vrx?2 


By definition, the support of both PDF and CDF of an Fy, distributed random 
variable is Ro. 
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Say we have an F distributed random variable Y with numerator degrees of 
freedom 3 and denominator degrees of freedom 14 and are interested in P(Y > 
2). This can be computed with help of the function pf(). By setting the 
argument lower.tail to FALSE we ensure that R computes 1 — P(Y < 2), 
i.e,the probability mass in the tail right of 2. 


pf(2, dfi = 3, df2 = 14, lower.tail = F) 
#> [1] 0.1603538 


We can visualize this probability by drawing a line plot of the related density 
and adding a color shading with polygon(). 


# define coordinate vectors for vertices of the polygon 
s<- (2h sag, 105) OO, 110) 
y s clOhdt(seq(2s, 10590701) 59:35 14) 2.0) 


# draw density of F_{3, 14} 
curve(df(x ,3 ,14), 

ylim = c(0, 0.8), 

slam = C(O 10). 

ylab = "Density", 

main = "Density Function") 


# draw the polygon 
polygon(x, y, col = "orange") 


Density Function 
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The F distribution is related to many other distributions. An important special 
case encountered in econometrics arises if the denominator degrees of freedom 
are large such that the Fm,n distribution can be approximated by the Fm, 
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distribution which turns out to be simply the distribution of a xĉ; random 
variable divided by its degrees of freedom M, 


W/M ~ Fu , W ~ Xir. 


2.2 Random Sampling and the Distribution of 
Sample Averages 


To clarify the basic idea of random sampling, let us jump back to the dice rolling 
example: 


Suppose we are rolling the dice n times. This means we are interested in the 
outcomes of random Y;, i = 1,...,n which are characterized by the same distri- 
bution. Since these outcomes are selected randomly, they are random variables 
themselves and their realizations will differ each time we draw a sample, i.e., 
each time we roll the dice n times. Furthermore, each observation is randomly 
drawn from the same population, that is, the numbers from 1 to 6, and their 
individual distribution is the same. Hence Yj,..., Yn are identically distributed. 


Moreover, we know that the value of any of the Y; does not provide any infor- 
mation on the remainder of the outcomes In our example, rolling a six as the 
first observation in our sample does not alter the distributions of Yo,...,Yn: 
all numbers are equally likely to occur. This means that all Y; are also inde- 
pendently distributed. Thus Yj,...,Y, are independently and identically dis- 
tributed (i.i.d.). The dice example uses this most simple sampling scheme. That 
is why it is called simple random sampling. This concept is summarized in Key 
Concept 2.5. 


Key Concept 2.5 
Simple Random Sampling and i.i.d. Random Variables 


In simple random sampling, n objects are drawn at random from a pop- 
ulation. Each object is equally likely to end up in the sample. We denote 
the value of the random variable Y for the it” randomly drawn object as 


Y;. Since all objects are equally likely to be drawn and the distribution of 
Y; is the same for all 2, the Y;,...,Y, are independently and identically 
distributed (i.i.d.). This means the distribution of Y; is the same for all 
i = 1,...,n and Y; is distributed independently of Yo,...,Y, and Yə is 
distributed independently of Y1, Y3,...,Y, and so forth. 





What happens if we consider functions of the sample data? Consider the ex- 
ample of rolling a dice two times in a row once again. A sample now consists 
of two independent random draws from the set {1,2,3,4,5,6}. It is apparent 
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that any function of these two random variables, e.g. their sum, is also random. 


Convince yourself by executing the code below several times. 


sum(sample(1:6, 2, replace = T)) 


ie E 9 


Clearly, this sum, let us call it S, is a random variable as it depends on randomly 
drawn summands. For this example, we can completely enumerate all outcomes 
and hence write down the theoretical probability distribution of our function of 


the sample data S: 


We face 6? = 36 possible pairs. Those pairs are 


Thus, possible outcomes for S are 


{2,3,4,5,6,7,8, 9, 10, 11, 12}. 


Enumeration of outcomes yields 


1/36, 
2/36, 
3/36, 
4/36, 
5/36, 
6/36, 
5/36, 
4/36, 
3/36, 
2/36, 
1/36, 


m © 
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N 


(2.18) 


We can also compute E(S) and Var(S) as stated in Key Concept 2.1 and Key 


Concept 2.2. 
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# Vector of outcomes 
E = Pig 


# Vector of probabilities 
PS <= 6G, 5:10 7/36 


# Expectation of S 
ES <- sum(S * PS) 
ES 

#> [1] 7 


# Variance of S 

Vars <- sum((S = c(ES))*2 * PS) 
Vars 

#> [1] 5.833333 


So the distribution of S is known. It is also evident that its distribution differs 
considerably from the marginal distribution, i.e,the distribution of a single dice 
roll’s outcome, D . Let us visualize this using bar plots. 


# divide the plotting area into one row with two columns 
par(mfrow = c(1, 2)) 


# plot the distribution of S 


barplot (PS, 
ylim = ¢(0, 0-2), 
xlab = 1S% 
ylab = “Probabilaty 
col = "steelblue", 
space = 0; 
main = "Sum of Two Dice Rolls") 


# plot the distribution of D 
probability <- rep(1/6, 6) 
names (probability) <- 1:6 


barplot (probability, 
ylim = c(0, 0.2), 
xlab = IDH 
col = "steelblue", 


space! = 0; 
main = "Outcome of a Single Dice Roll") 
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Many econometric procedures deal with averages of sampled data. It is typi- 
cally assumed that observations are drawn randomly from a larger, unknown 
population. As demonstrated for the sample function S, computing an average 
of a random sample has the effect that the average is a random variable itself. 
This random variable in turn has a probability distribution, called the sam- 
pling distribution. Knowledge about the sampling distribution of the average is 
therefore crucial for understanding the performance of econometric procedures. 





The sample average of a sample of n observations Y1,..., Yn is 
> 1c 1 
Y=-) Y=- +Y:4 H Yn) 
n n 


Y is also called the sample mean. 


Mean and Variance of the Sample Mean 


suppose that Y1,...,Yn are i.i.d. and denote py and of as the mean and the 
variance of the Y;. Then we have that 
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1 n 
Var(Y) =Var (Sx) 
o a : cov(Yj, Yj) 
= a 


=1,jźi 


i<j z ee 


The second summand vanishes since cov(Y;, Yj) = 0 for i # j due to inde- 


pendence. Consequently, the standard deviation of the sample mean is given 
by 


oy 
OF = i 
Y yn 


It is worthwhile to mention that these results hold irrespective of the underlying 
distribution of the Y;. 


The Sampling Distribution of Y when Y Is Normally Distributed 


If the Yj,...,Y, are i.i.d. draws from a normal distribution with mean py and 
variance o,, the following holds for their sample average Y: 


Y ~N (py, 04/1) (2.4) 


For example, if a sample Y; with ¿i = 1,...,10 is drawn from a standard normal 
distribution with mean py = 0 and variance of. = 1 we have 


Y ~ N(0,0.1). 


We can use R’s random number generation facilities to verify this result. The 
basic idea is to simulate outcomes of the true distribution of Y by repeatedly 
drawing random samples of 10 observation from the M (0,1) distribution and 
computing their respective averages. If we do this for a large number of rep- 
etitions, the simulated data set of averages should quite accurately reflect the 
theoretical distribution of Y if the theoretical result holds. 


The approach sketched above is an example of what is commonly known as 
Monte Carlo Simulation or Monte Carlo Experiment. To perform this simula- 
tion in R, we proceed as follows: 


1. Choose a sample size n and the number of samples to be drawn, reps. 
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2. Use the function replicate() in conjunction with rnorm() to draw n 
observations from the standard normal distribution rep times. 


Note: the outcome of replicate() is a matrix with dimensions n x rep. 
It contains the drawn samples as columns. 


3. Compute sample means using colMeans(). This function computes the 
mean of each column, i.e., of each sample and returns a vector. 


# set sample size and number of samples 
n <- 10 
reps <- 10000 


# perform random sampling 
samples <- replicate(reps, rnorm(n)) # 10 x 10000 sample matris 


# compute sample means 
sample.avgs <- colMeans (samples) 


We then end up with a vector of sample averages. You can check the vector 
property of sample.avgs: 


# check that 'sample.avgs' ts a vector 
is.vector(sample.avgs) 
#> [1] TRUE 


# print the first few entries to the console 
head(sample.avgs) 
#> [1] -0.1045919 0.2264301 0.5308715 -0.2243476 0.2186909 0.2564663 


A straightforward approach to examine the distribution of univariate numerical 
data is to plot it as a histogram and compare it to some known or assumed 
distribution. By default, hist() will give us a frequency histogram, i.e., a bar 
chart where observations are grouped into ranges, also called bins. The ordinate 
reports the number of observations falling into each of the bins. Instead, we 
want it to report density estimates for comparison purposes. This is achieved 
by setting the argument freq = FALSE. The number of bins is adjusted by the 
argument breaks. 


Using curve (), we overlay the histogram with a red line, the theoretical density 
of a N (0,0.1) random variable. Remember to use the argument add = TRUE to 
add the curve to the current plot. Otherwise R will open a new graphic device 
and discard the previous plot!! 





1 Hint: T and F are alternatives for TRUE and FALSE. 
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# Plot the density histogram 
hist (sample.avgs, 
ylim = c(0, 1.4), 


col = "steelblue" , 
freq =F, 
breaks = 20) 


# overlay the theoretical distribution of sample averages on top of the histogram 
curve(dnorm(x, sd = 1/sqrt(n)), 
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The sampling distribution of Y is indeed very close to that of a M (0, 0.1) dis- 
tribution so the Monte Carlo simulation supports the theoretical claim. 


Let us discuss another example where using simple random sampling in a sim- 
ulation setup helps to verify a well known result. As discussed before, the Chi- 
squared distribution with M degrees of freedom arises as the distribution of the 
sum of M independent squared standard normal distributed random variables. 


To visualize the claim stated in equation (2.3), we proceed similarly as in the 
example before: 


1. Choose the degrees of freedom, DF, and the number of samples to be drawn 
reps. 

2. Draw reps random samples of size DF from the standard normal distribu- 
tion using replicate(). 

3. For each sample, square the outcomes and sum them up column-wise. 
Store the results. 
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Again, we produce a density estimate for the distribution underlying our sim- 
ulated data using a density histogram and overlay it with a line graph of the 
theoretical density function of the x3 distribution. 


# number of repetitions 
reps <- 10000 


# set degrees of freedom of a chi-Square Distribution 
DE <= 3 


# sample 10000 column vectors à 3 N(0,1) R.V.S 
Z <- replicate(reps, rnorm(DF)) 


# column sums of squares 
X <- colSums(Z*2) 


# histogram of column sums of squares 


hist(X, 
freq = F, 
col = "steelblue", 
breaks = 40, 
ylab = "Density", 
main = "") 


# add theoretical density 
curve(dchisq(x, df = DF), 


type = 'l', 
lwd = 2, 
col = "red", 
add = T) 
o 
N 
(=) 
= 
O 
c oO 
o = 
QO o 
o 
2 
=) 





2.2. RANDOM SAMPLING AND THE DISTRIBUTION OF SAMPLE AVERAGES53 


Large Sample Approximations to Sampling Distributions 


Sampling distributions as considered in the last section play an important role 
in the development of econometric methods. There are two main approaches 
in characterizing sampling distributions: an “exact” approach and an “approx- 
imate” approach. 


The exact approach aims to find a general formula for the sampling distribution 
that holds for any sample size n. We call this the exact distribution or finite- 
sample distribution. In the previous examples of dice rolling and normal variates, 
we have dealt with functions of random variables whose sample distributions are 
exactly known in the sense that we can write them down as analytic expressions. 
However, this is not always possible. For Y, result (2.4) tells us that normality of 
the Y; implies normality of Y (we demonstrated this for the special case of Y; oe 
N(0,1) with n = 10 using a simulation study that involves simple random 
sampling). Unfortunately, the exact distribution of Y is generally unknown and 
often hard to derive (or even untraceable) if we drop the assumption that the 
Y; have a normal distribution. 


Therefore, as can be guessed from its name, the “approximate” approach aims 
to find an approximation to the sampling distribution where it is required that 
the sample size n is large. A distribution that is used as a large-sample approx- 
imation to the sampling distribution is also called the asymptotic distribution. 
This is due to the fact that the asymptotic distribution is the sampling distri- 
bution for n > œ, i.e., the approximation becomes exact if the sample size goes 
to infinity. However, the difference between the sampling distribution and the 
asymptotic distribution is negligible for moderate or even small samples sizes 
so that approximations using the asymptotic distribution are useful. 


In this section we will discuss two well known results that are used to approxi- 
mate sampling distributions and thus constitute key tools in econometric theory: 
the law of large numbers and the central limit theorem. The law of large num- 
bers states that in large samples, Y is close to wy with high probability. The 
central limit theorem says that the sampling distribution of the standardized 
sample average, that is, (Y — uy )/oz is asymptotically normally distributed. It 
is particularly interesting that both results do not depend on the distribution 
of Y. In other words, being unable to describe the complicated sampling distri- 
bution of Y if Y is not normal, approximations of the latter using the central 
limit theorem simplify the development and applicability of econometric proce- 
dures enormously. This is a key component underlying the theory of statistical 
inference for regression models. Both results are summarized in Key Concept 
2.6 and Key Concept 2.7. 
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Key Concept 2.6 
Convergence in Probability, Consistency and the Law of Large 
Numbers 


The sample average Y converges in probability to py: Y is consistent for 
Ly if the probability that Y is in the range (uy — €) to (uy +€) becomes 
arbitrary close to 1 as n increases for any constant e > 0. We write this 
as 


P(py e y < py +6) 3 1,€>0asn> 0. 


Consider the independently and identically distributed random variables 
Y; i =1,...,n with expectation E(Y;) = uy and variance Var(Y;) = o2. 
Under the condition that of < ov, that is, large outliers are unlikely, 
the law of large numbers states 


Y > py. 


The following application simulates a large number of coin tosses (you 
may set the number of trials using the slider) with a fair coin and 
computes the fraction of heads observed for each additional toss. The 
result is a random path that, as stated by the law of large numbers, 
shows a tendency to approach the value of 0.5 as n grows. 


This interactive application is only available in the HTML version. 





The core statement of the law of large numbers is that under quite general 
conditions, the probability of obtaining a sample average Y that is close to py 
is high if we have a large sample size. 


Consider the example of repeatedly tossing a coin where Y; is the result of the 
it’ coin toss. Y; is a Bernoulli distributed random variable with p the probability 


of observing head 
Y,=1 


where p = 0.5 as we assume a fair coin. It is straightforward to show that 


by =p=0.5. 


Let Rn denote the proportion of heads in the first n tosses, 
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According to the law of large numbers, the observed proportion of heads con- 
verges in probability to uy = 0.5, the probability of tossing head in a single 
coin toss, 

Rn S py = 0.5 as n — oœ. 


This result is illustrated by the interactive application in Key Concept 2.6. We 
now show how to replicate this using R. 


The procedure is as follows: 


1. Sample N observations from the Bernoulli distribution, e.g., using 
sample(). 

2. Calculate the proportion of heads R, as in (2.5). A way to achieve this is 
to call cumsum() on the vector of observations Y to obtain its cumulative 
sum and then divide by the respective number of observations. 


We continue by plotting the path and also add a dashed line for the benchmark 
probability p = 0.5. 


# set seed 
set.seed(1) 


# set number of coin tosses and simulate 
N <- 30000 
Y <- sample(0:1, N, replace = T) 


# Calculate R_n for 1:N 
S <- cumsum(Y) 


R <= S/(1:N) 


# Plot the path. 


plot(R, 
ylim = c(0.3,. 0.7), 
type = "1", 
col = "steelblue", 
lwd = 2, 
xlab = "n", 
ylab = "R_n", 
main = "Converging Share of Heads in Repeated Coin Tossing") 


# Add a dashed line for Rn = 0.5 
lines(c(0, N), 


ekore 025) 

col = "darkred", 
eA =a? 
waded) 
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Converging Share of Heads in Repeated Coin Tossing 
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There are several things to be said about this plot. 


e The blue graph shows the observed proportion of heads when tossing a 
coin n times. 


e Since the Y; are random variables, R, is a random variate, too. The path 
depicted is only one of many possible realizations of Rn as it is determined 
by the 30000 observations sampled from the Bernoulli distribution. 


e If the number of coin tosses n is small, the proportion of heads may be 
anything but close to its theoretical value, uy = 0.5. However, as more 
and more observation are included in the sample we find that the path 
stabilizes in the neighborhood of 0.5. The average of multiple trials shows 
a clear tendency to converge to its expected value as the sample size 
increases, just as claimed by the law of large numbers. 
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Key Concept 2.7 
The Central Limit Theorem 


Suppose that Y1,..., Yn are independently and identically distributed 
random variables with expectation E(Y;) = uy and variance Var(Y;) = 
o? where 0 < o% < œo. The Central Limit Theorem (CLT) states that, 
if the sample size n goes to infinity, the distribution of the standardized 
sample average 


Y-py  Y-py 


oF ay //n 
becomes arbitrarily well approximated by the standard normal distribu- 
tion. 
The application below demonstrates the CLT for the sample average of 
normally distributed random variables with mean 5 and variance 257. 
You may check the following properties: 





The distribution of the sample average is normal. 


As the sample size increases, the distribution of Y tightens around 
the true mean of 5. 


The distribution of the standardized sample average is close to the 
standard normal distribution for large n. 


This interactive application is only available in the HTML version. 





According to the CLT, the distribution of the sample mean Y of the Bernoulli 
distributed random variables Y;, i = 1, ..., n, is well approximated by the normal 
distribution with parameters wy = p = 0.5 and of = p(1 — p)/n = 0.25/n for 
large n. Consequently, for the standardized sample mean we conclude that 


Y —0.5 
0.5//n 


should be well approximated by the standard normal distribution V(0, 1). We 
employ another simulation study to demonstrate this graphically. The idea is 
as follows. 


(2.6) 


Draw a large number of random samples, 10000 say, of size n from the Bernoulli 
distribution and compute the sample averages. Standardize the averages as 
shown in (2.6). Next, visualize the distribution of the generated standardized 
sample averages by means of a histogram and compare to the standard normal 
distribution. Repeat this for different sample sizes n to see how increasing the 
sample size n impacts the simulated distribution of the averages. 
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In R, realize this as follows: 


1. We start by defining that the next four subsequently generated figures shall 
be drawn in a 2 x 2 array such that they can be easily compared. This is 
done by calling par(mfrow = c(2, 2)) before generating the figures. 


2. We define the number of repetitions reps as 10000 and create a vector of 
sample sizes named sample.sizes. We consider samples of sizes 5, 20, 
75, and 100. 


3. Next, we combine two for () loops to simulate the data and plot the distri- 
butions. The inner loop generates 10000 random samples, each consisting 
of n observations that are drawn from the Bernoulli distribution, and com- 
putes the standardized averages. The outer loop executes the inner loop 
for the different sample sizes n and produces a plot for each iteration. 


# subdivide the plot panel into a 2-by-2 array 
par(mfrow = c(2, 2)) 


# set the number of repetitions and the sample sizes 
reps <- 10000 
sample.sizes <- c(5, 20, 75, 100) 


# set seed for reproducibility 
set.seed (123) 


# outer loop (loop over the sample sizes) 
for (n in sample.sizes) { 


samplemean <- rep(0, reps) #initialize the vector of sample means 
stdsamplemean <- rep(0, reps) #initialize the vector of standardized sample means 


# inner loop (loop over repetitions) 
for (i in 1:reps) { 
x <- rbinom(n, 1, 0.5) 
samplemean[i] <- mean(x) 
stdsamplemean[i] <- sqrt(n)*(mean(x) - 0.5)/0.5 
} 


# plot histogram and overlay the N(0,1) density in every iteration 
hist (stdsamplemean, 
col = "steelblue", 
fredi = FALSE, 
breaks = 40, 
zalim S ES oe 
ylim = c(0, 0.8), 
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xlab = paste("n =", n), 
main = "") 


curve (dnorm(x) , 














lwd = 2, 
col = "darkred", 
add = TRUE) 
} 
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We see that the simulated sampling distribution of the standardized average 
tends to deviate strongly from the standard normal distribution if the sample 
size is small, e.g., for n = 5 and n = 10. However as n grows, the histograms 


approach the standard normal distribution. The approximation works quite 


well, see n = 100. 


2.3 Exercises 


This interactive part of the book is only available in the HTML version. 
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Chapter 3 


A Review of Statistics using 
R 


This section reviews important statistical concepts: 


e Estimation of unknown population parameters 

e Hypothesis testing 

e Confidence intervals 
These methods are heavily used in econometrics. We will discuss them in the 
simple context of inference about an unknown population mean and discuss 
several applications in R. These R applications rely on the following packages 
which are not part of the base version of R: 

e readxl - allows to import data from Excel to R. 

e dplyr - provides a flexible grammar for data manipulation. 

e MASS - a collection of functions for applied statistics. 
Make sure these are installed before you go ahead and try to replicate the 


examples. The safest way to do so is by checking whether the following code 
chunk executes without any errors. 


library (dplyr) 


library (MASS) 
library (readx1) 
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3.1 Estimation of the Population Mean 


Key Concept 3.1 
Estimators and Estimates 


Estimators are functions of sample data that are drawn randomly from 


an unknown population. Estimates are numeric values computed by 
estimators based on the sample data. Estimators are random variables 
because they are functions of random data. Estimates are nonrandom 
numbers. 





Think of some economic variable, for example hourly earnings of college gradu- 
ates, denoted by Y. Suppose we are interested in uy the mean of Y. In order to 
exactly calculate wy we would have to interview every working graduate in the 
economy. We simply cannot do this due to time and cost constraints. However, 
we can draw a random sample of n i.i.d. observations Y,,...,Y, and estimate 
Hy using one of the simplest estimators in the sense of Key Concept 3.1 one can 
think of, that is, 


Y= 


Sir 


om 
i=l 


the sample mean of Y. Then again, we could use an even simpler estimator for 
Ly: the very first observation in the sample, Yı. Is Yı a good estimator? For 
now, assume that 


Y ~ Xiz 


which is not too unreasonable as hourly income is non-negative and we expect 
many hourly earnings to be in a range of 5£ to 15£. Moreover, it is common 
for income distributions to be skewed to the right — a property of the x? 
distribution. 


# plot the chi_12°2 distribution 
curve(dchisq(x, df=12), 

from = 0, 

to = 40, 

ylab = "density", 

xlab = "hourly earnings in Euro") 
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hourly earnings in Euro 


We now draw a sample of n = 100 observations and take the first observation 
Yı as an estimate for uy 


# set seed for reproducibility 
set .seed(1) 


# sample from the chi_12°2 distribution, use only the first observation 
rsamp <- rchisq(n = 100, df = 12) 

rsamp[1] 

#> [1] 8.257893 


The estimate 8.26 is not too far away from pry = 12 but it is somewhat intuitive 
that we could do better: the estimator Yı discards a lot of information and its 
variance is the population variance: 


Var(Y,) = Var(Y) = 2-12 = 24 


This brings us to the following question: What is a good estimator of an unknown 
parameter in the first place? This question is tackled in Key Concepts 3.2 and 
3.3. 
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Key Concept 3.2 
Bias, Consistency and Efficiency 


Desirable characteristics of an estimator include unbiasedness, consis- 
tency and efficiency. 


Unbiasedness: 
If the mean of the sampling distribution of some estimator fiy for the 
population mean py equals py, 


E(fiy) = ym, 


the estimator is unbiased for wy. The bias of y then is 0: 


E(fiy) — py =0 
Consistency: 
We want the uncertainty of the estimator uy to decrease as the number 
of observations in the sample grows. More precisely, we want the proba- 
bility that the estimate fiy falls within a small interval around the true 
value uy to get increasingly closer to 1 as n grows. We write this as 


howe 
by — Hy. 
Variance and efficiency: 


We want the estimator to be efficient. Suppose we have two estimators, 
jy and uy and for some given sample size n it holds that 


E(fiy) = E(uy) = py 


but F 
Var(jiy) < Var(jiy). 


We then prefer to use fiy as it has a lower variance than jiy-, meaning that 
Êy is more efficient in using the information provided by the observations 
in the sample. 
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3.2 Properties of the Sample Mean 





A more precise way to express consistency of an estimator fi for a pa- 
rameter u is 


P(|i—pl|<e)—" 41 forany e>0. 
n—> oo 


This expression says that the probability of observing a deviation from 
the true value yz that is smaller than some arbitrary € > 0 converges to 
1 as n grows. Consistency does not require unbiasedness. 











To examine properties of the sample mean as an estimator for the corresponding 
population mean, consider the following R example. 


We generate a population pop consisting of observations Y;, i = 1,...,10000 
that origin from a normal distribution with mean u = 10 and variance o? = 1. 


To investigate the behavior of the estimator ñ = Y we can draw random samples 
from this population and calculate Y for each of them. This is easily done by 
making use of the function replicate(). The argument expr is evaluated n 
times. In this case we draw samples of sizes n = 5 and n = 25, compute the 
sample means and repeat this exactly N = 25000 times. 


For comparison purposes we store results for the estimator Y1, the first obser- 
vation in a sample for a sample of size 5, separately. 


# generate a fictitious population 
pop <- rnorm(10000, 10, 1) 


# sample from the population and estimate the mean 
esti <- replicate(expr = mean(sample(x = pop, size = 5)), n = 25000) 


est2 <- replicate(expr = mean(sample(x = pop, size = 25)), n = 25000) 


fo <- replicate(expr = sample(x = pop, size = 5)[1], n = 25000) 


Check that est1 and est2 are vectors of length 25000: 


# check tf object type is vector 
is.vector(est1) 

#> [1] TRUE 

is.vector(est2) 

#> [1] TRUE 
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# check length 
length(est1) 
#> [1] 25000 
length(est2) 
#> [1] 25000 


The code chunk below produces a plot of the sampling distributions of the 
estimators Y and Yı on the basis of the 25000 samples in each case. We also 
plot the density function of the (10,1) distribution. 


# plot density estimate Y_1 
plot (density(fo), 


cole = Vereen, 

lwd = 2, 

ylim = c(0, 2), 

xlab = "estimates", 

main = "Sampling Distributions of Unbiased Estimators") 


# add density estimate for the distribution of the sample mean with n=5 to the plot 
lines(density(est1), 


col = "steelblue", 
lwd = 2, 
bty = "1") 


# add density estimate for the distribution of the sample mean with n=25 to the plot 


lines (density (est2), 
col = "red2", 
lwd = 2) 


# add a vertical line at the true parameter 
abline(v = 10, lty = 2) 


# add N(10,1) density to the plot 
curve(dnorm(x, mean = 10), 


lwd = 2, 
Ity = 2; 
add = T) 


# add a legend 
legend("topleft", 
legend = c("N(10,1)", 
expression (Y[1]), 
expression(bar(Y) ~ n 
expression(bar(Y) ~ n 


); 


5), 
25) 
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tey = Ol, al, aly 2); 
col = c("black","green", "steelblue", "red2"), 
lwd = 2) 


Sampling Distributions of Unbiased Estimators 
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First, all sampling distributions (represented by the solid lines) are centered 
around u = 10. This is evidence for the unbiasedness of Y1, Y5 and Yo. Of 
course, the theoretical density \V(10,1) is centered at 10, too. 


Next, have a look at the spread of the sampling distributions. Several things 
are noteworthy: 


e The sampling distribution of Yı (green curve) tracks the density of the 
N (10,1) distribution (black dashed line) pretty closely. In fact, the sam- 
pling distribution of Yı is the N (10, 1) distribution. This is less surprising 
if you keep in mind that the Yı estimator does nothing but reporting an 
observation that is randomly selected from a population with M(10, 1) 
distribution. Hence, Yı ~ M(10,1). Note that this result does not depend 
on the sample size n: the sampling distribution of Yı is always the pop- 
ulation distribution, no matter how large the sample is. Yı is a good a 
estimate of wy, but we can do better. 


e Both sampling distributions of Y show less dispersion than the sampling 
distribution of Yı. This means that Y has a lower variance than Yı. In 
view of Key Concepts 3.2 and 3.3, we find that Y is a more efficient 
estimator than Yı. In fact, this holds for all n > 1. 


e Y shows a behavior illustrating consistency (see Key Concept 3.2). The 
blue and the red densities are much more concentrated around u = 10 than 
the green one. As the number of observations is increased from 1 to 5, 
the sampling distribution tightens around the true parameter. Increasing 
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the sample size to 25, this effect becomes more apparent. This implies 
that the probability of obtaining estimates that are close to the true value 
increases with n. This is also reflected by the estimated values of the 
density function close to 10: the larger the sample size, the larger the 
value of the density. 


We encourage you to go ahead and modify the code. Try out different values 
for the sample size and see how the sampling distribution of Y changes! 


Y is the Least Squares Estimator of py 


Assume you have some observations Yj,...,¥, on Y ~ N(10,1) (which is un- 
known) and would like to find an estimator m that predicts the observations 
as well as possible. By good we mean to choose m such that the total squared 
deviation between the predicted value and the observed values is small. Math- 
ematically, this means we want to find an m that minimizes 


n 


YY- m)’. (3.1) 


i=1 


Think of Y; — m as the mistake made when predicting Y; by m. We could 
also minimize the sum of absolute deviations from m but minimizing the sum 
of squared deviations is mathematically more convenient (and will lead to a 
different result). That is why the estimator we are looking for is called the least 
squares estimator. m = Y , the sample mean, is this estimator. 


We can show this by generating a random sample and plotting (3.1) as a function 
of m. 


# define the function and vectorize it 
sqm <- function(m) { 

sum ((y-m) ^2) 

} 


sqm <- Vectorize(sqm) 


# draw random sample and compute the mean 
y <- rnorm(100, 10, 1) 

mean (y) 

#> [1] 10.1364 


# plot the objective function 
curve (sqm(x), 

from = —50;; 

to = 70, 
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xlab = "mi; 
ylab = "sqm(m)") 


# add vertical line at mean(y) 
abline(v = mean(y), 

lty = 2; 

col = "darkred") 


# add annotation at mean(y) 
text(x = mean(y), 
y=, 
labels = paste(round(mean(y), 2))) 
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Notice that (3.1) is a quadratic function so that there is only one minimum. The 
plot shows that this minimum lies exactly at the sample mean of the sample 


data. 
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Some R functions can only interact with functions that take a vector as 
an input and evaluate the function body on every entry of the vector, 
for example curve(). We call such functions vectorized functions and 
it is often a good idea to write vectorized functions yourself, although 
this is cumbersome in some cases. Having a vectorized function in R is 
never a drawback since these functions work on both single values and 
vectors. 
Let us look at the function sqm(), which is non-vectorized: 
sqm <- function(m) { 

sum((y-m)*2) #body of the function 


Providing, e.g., c(1,2,3) as the argument m would cause an error since 
then the operation y-m is invalid: the vectors y and m are of incompat- 
ible dimensions. This is why we cannot use sqm() in conjunction with 
curve(). 

Here Vectorize() comes into play. It generates a vectorized version of a 
non-vectorized function. 











Why Random Sampling is Important 


So far, we assumed (sometimes implicitly) that the observed data Y;,...,Y;, are 
the result of a sampling process that satisfies the assumption of simple random 
sampling. This assumption often is fulfilled when estimating a population mean 
using Y. If this is not the case, estimates may be biased. 


Let us fall back to pop, the fictive population of 10000 observations and compute 
the population mean fpop: 


# compute the population mean of pop 
mean (pop) 
#> [1] 9.992604 


Next we sample 10 observations from pop with sample() and estimate {pop 
using Y repeatedly. However, now we use a sampling scheme that deviates 
from simple random sampling: instead of ensuring that each member of the 
population has the same chance to end up in a sample, we assign a higher 
probability of being sampled to the 2500 smallest observations of the population 


by setting the argument prob to a suitable vector of probability weights: 


# simulate outcomes for the sample mean when the i.i.d. assumption fails 
est3 <- replicate(n = 25000, 
expr = mean(sample(x = sort(pop), 
size = 10, 
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prob = c(rep(4, 2500), rep(1, 7500))))) 


# compute the sample mean of the outcomes 
mean (est3) 


> a 9.247113 


Next we plot the sampling distribution of Y for this non-i.i.d. case and compare 
it to the sampling distribution when the i.i.d. assumption holds. 


# sampling distribution of sample mean, i.i.d. holds, n=25 
plot (density (est2) , 


col = "steelblue", 
lwd = 2, 

xlim = c(8, 11); 
xlab = "Estimates", 


main "When the i.i.d. Assumption Fails") 
# sampling distribution of sample mean, t.i.d. fatls, n=25 
lines (density (est3) , 

col = "red2", 

lwd = 2) 


# add a legend 
legend("topleft", 
legend = c(expression (bar (Y) [n 
expression (bar (Y) [n 
2s 
lty = c(1, 1), 
col = c("red2", "steelblue"), 


25) |e i d faits, 
25-2 ai d holds!) 
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Here, the failure of the i.i.d. assumption implies that, on average, we underes- 
timate py using Y: the corresponding distribution of Y is shifted to the left. 
In other words, Y is a biased estimator for py if the i.i.d. assumption does not 
hold. 


3.3 Hypothesis Tests Concerning the Popula- 
tion Mean 


In this section we briefly review concepts in hypothesis testing and discuss how 
to conduct hypothesis tests in R. We focus on drawing inferences about an 
unknown population mean. 


About Hypotheses and Hypothesis Testing 


In a significance test we want to exploit the information contained in a sample 
as evidence in favor or against a hypothesis. Essentially, hypotheses are simple 
questions that can be answered by ‘yes’ or ‘no’. In a hypothesis test we typically 
deal with two different hypotheses: 


e The null hypothesis, denoted Ho, is the hypothesis we are interested in 
testing. 


e There must be an alternative hypothesis, denoted H,, the hypothesis that 
is thought to hold if the null hypothesis is rejected. 


The null hypothesis that the population mean of Y equals the value uy, is 
written as 


Ho : E(Y) = [y,0- 


Often the alternative hypothesis chosen is the most general one, 


Hı K E(Y) Æ HY,0; 
meaning that E(Y) may be anything but the value under the null hypothesis. 
This is called a two-sided alternative. 


For the sake of brevity, we only consider two-sided alternatives in the subsequent 
sections of this chapter. 
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The p-Value 


Assume that the null hypothesis is true. The p-value is the probability of draw- 
ing data and observing a corresponding test statistic that is at least as adverse to 
what is stated under the null hypothesis as the test statistic actually computed 
using the sample data. 


In the context of the population mean and the sample mean, this definition can 
be stated mathematically in the following way: 


> pact 
p-value = Pa [IV = uvol > IV" = py.o| (3.2) 


In (3.2), Y°% is the sample mean for the data at hand (a value). In order to 
compute the p-value as in (3.2), knowledge about the sampling distribution of 
Y (a random variable) when the null hypothesis is true (the null distribution) 
is required. However, in most cases the sampling distribution and thus the null 
distribution of Y are unknown. Fortunately, the CLT (see Key Concept 2.7) 
allows for the large-sample approximation 


Y ~ N (uyo, oz) » OF = —_, 


assuming the null hypothesis Hp : E(Y) = uy,o is true. With some algebra it 
follows for lage n that 


Y — Hyo 
TAYO O N (0,1). 


So in large samples, the p-value can be computed without knowledge of the exact 
sampling distribution of Y using the above normal approximation. 


Calculating the p-Value when the Standard Deviation is 
Known 


For now, let us assume that oy is known. Then, we can rewrite (3.2) as 


yer =, uyo 





p-value = Py, 





eae 











=2. 9 














The p-value is the area in the tails of the M (0, 1) distribution that lies beyond 
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We now use R to visualize what is stated in (3.4) and (3.5). The next code 
chunk replicates Figure 3.1 of the book. 


# plot the standard normal density on the interval [-4,4] 
curve (dnorm(x) , 
xlim = c(-4, 4), 


main = "Calculating a p-Value", 
yaxs = "i", 

xlab = "z", 

ylab = "", 

lwd = 2, 

axes = "F") 


# add z-azis 


axis(1, 
Be = ely, Oy ala), 
padj = Oo, 
labels = c(expression(-frac(bar(Y)“"act"~-~bar(mu) [Y,0], sigma[bar(Y)])), 


0, 
expression (frac (bar (Y)^"act"~-~bar (mu) [Y,0], sigma[bar(Y)])))) 


# shade p-value/2 region in left tail 

polygon(x = cG6, seq(-—G. =1.5. 0201))> =175)), 
y = c(O, dnorm(seq(-6, -1.5, 0.01)),0), 
col = "steelblue") 


# shade p-value/2 region in right tail 
polygon(x = c(1.5, seq Gon 6. 0-01, 6), 
y = cO, dnorm(seq(1.5, 6, 0.01)), 0), 
col = "steelblue") 
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Calculating a p-Value 


Sample Variance, Sample Standard Deviation and Stan- 
dard Error 


If o? is unknown, it must be estimated. This can be done using the sample 
variance 





= Dr-F). (3.6) 


Furthermore 








sy = So -Y) (3.7) 


is a suitable estimator for the standard deviation of Y. In R, sy is implemented 
in the function sd(), see ?sd. 


Using R we can illustrate that sy is a consistent estimator for oy, that is 
sy 2 Oy. 
The idea here is to generate a large number of samples Y;,...,Y;, where, Y ~ 


N(10,9) say, estimate oy using sy and investigate how the distribution of sy 
changes as n gets larger. 
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# vector of sample sizes 
n <- c(10000, 5000, 2000, 1000, 500) 


# sample observations, estimate using 'sd()' and plot the estimated distributions 
sq_y <- replicate(n = 10000, expr = sd(rnorm(n[1], 10, 3))) 
plot (density(sq_y), 


main = expression("Sampling Distributions o" ~ s[Y]), 
xlab = expression(s[y]), 
lwd = 2) 


for (i in 2:length(n)) { 
sq_y <- replicate(n = 10000, expr = sd(rnorm(n[i], 10, 3))) 
lines(density(sq_y), 
col =i, 
lwd = 2) 


# add a legend 
legend("topleft", 
legend = c(expression(n == 10000), 











expression(n == 5000), 
expression(n == 2000), 
expression(n == 1000), 
expression(n == 500)), 
Cole aio, 
lwd = 2) 
Sampling Distributions 0 sy 
o 
A 
— n=10000 
te) — n=5000 
5 ~ — n=2000 
= | | — n=1000 
5 = | — n=500 
Q 
lO 
oO — 
| | T 
2.95 3.00 3.05 3.10 


The plot shows that the distribution of sy tightens around the true value oy = 3 
as n increases. 


The function that estimates the standard deviation of an estimator is called the 
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standard error of the estimator. Key Concept 3.4 summarizes the terminology 
in the context of the sample mean. 


Key Concept 3.4 
The Standard Error of Y 


Take an i.i.d. sample Y1, ..., Yn. The mean of Y is consistently estimated 
by Y, the sample mean of the Y;. Since Y is a random variable, it has a 


2 
sampling distribution with variance Z, 


The standard error of Y, denoted SE(Y) is an estimator of the standard 
deviation of Y: 


Sy 
Van 


The caret (C) over ø indicates that Gy is an estimator for oy. 


SEY) =c7= 





As an example to underpin Key Concept 3.4, consider a sample of n = 100 i.i.d. 
observations of the Bernoulli distributed variable Y with success probability 
p=0.1. Thus E(Y) = p= 0.1 and Var(Y) = p(1 — p). E(Y) can be estimated 
by Y, which then has variance 


o= = p(1 — p)/n = 0.0009 


and standard deviation 


oy = yp(1 — p)/n = 0.03. 


In this case the standard error of Y can be estimated by 





SEY) =\/Y(1-Y)/n. 


Let us check whether Y and SE(Y) estimate the respective true values, on 
average. 


# draw 10000 samples of size 100 and estimate the mean of Y and 
# estimate the standard error of the sample mean 


mean_estimates <- numeric(10000) 
se_estimates <- numeric(10000) 


for (i in 1:10000) { 
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s <- sample(0:1, 
sazes— 100 
prob. = c0-97 Onl) e 
replace = T) 


mean_estimates[i] <- mean(s) 
se_estimates[i] <- sqrt(mean(s) * (1 - mean(s)) / 100) 


} 


mean (mean_estimates) 
#> [1] 0.10047 
mean(se_estimates) 
#> [1] 0.02961587 


Both estimators seem to be unbiased for the true parameters. In fact, this is 
true for the sample mean, but not for SE(Y). However, both estimators are 
consistent for the true parameters. 


Calculating the p-value When the Standard Deviation is 
Unknown 


When cy is unknown, the p-value for a hypothesis test concerning uy using Y 


can be computed by replacing os in (3.4) by the standard error SE(Y) = 65. 
Then, 


p-value = 2-©® (- 








This is easily done in R: 


# sample and estimate, compute standard error 
samplemean_act <- mean( 
sample(0:1, 
prob = c(0.9, 0.1), 
replace- iy, 
size = 100)) 


SE_samplemean <- sqrt(samplemean_act * (1 - samplemean_act) / 100) 


# null hypothesis 
mean_hO <- 0.1 
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# compute the p-value 

pvalue <- 2 * pnorm(- abs(samplemean_act - mean_h0) / SE_samplemean) 
pvalue 

#> [1] 0.7492705 


Later in the book, we will encounter more convenient approaches to obtain 
t-statistics and p-values using R. 


The t-statistic 


In hypothesis testing, the standardized sample average 


Yy — 
aa a (3.8) 
SE(Y) 
is called a t-statistic. This t-statistic plays an important role in testing hypothe- 
ses about uy. It is a prominent example of a test statistic. 


Implicitly, we already have computed a t-statistic for Y in the previous code 
chunk. 


# compute a t-statistic for the sample mean 

tstatistic <- (samplemean_act - mean_h0) / SE_samplemean 
tstatistic 

#> [1] 0.3196014 


Using R we can illustrate that if uy, equals the true value, that is, if the null 
hypothesis is true, (3.8) is approximately N (0, 1) distributed when n is large. 


# prepare empty vector for t-statistics 
tstatistics <- numeric(10000) 


# set sample size 
n <- 300 


# simulate 10000 t-statistics 
for (i in 1:10000) { 


s <- sample(0:1, 
size =n, 
prob = c(0.9, 0.1), 
replace = T) 


tstatistics[i] <- (mean(s)-0.1)/sqrt(var(s)/n) 
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In the simulation above we estimate the variance of the Y; using var(s). This 
is more general then mean(s)*(1-mean(s)) since the latter requires that the 
data are Bernoulli distributed and that we know this. 


# plot density and compare to N(0,1) density 
plot (density(tstatistics) , 
xlab = "t-statistic", 


main = "Estimated Distribution of the t-statistic when n=300", 
lwd = 2, 

xlim = c(-4, 4), 

col = "steelblue") 


# N(0,1) density (dashed) 
curve (dnorm(x) , 


add = T, 
lty = 2, 
lwd = 2) 


Estimated Distribution of the t-statistic when n=300 





+ 
ro) 
o 
o 
> 
h= 
o N 
5 9 
Q 
= 
ro) 
Q 
ro) 








t-statistic 


Judging from the plot, the normal approximation works reasonably well for the 
chosen sample size. This normal approximation has already been used in the 
definition of the p-value, see (3.8). 
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Hypothesis Testing with a Prespecified Significance Level 


Key Concept 3.5 
The Terminology of Hypothesis Testing 


In hypothesis testing, two types of mistakes are possible: 


1. The null hypothesis is rejected although it is true (type-I-error) 


2. The null hypothesis is not rejected although it is false (type-II- 
error) 


The significance level of the test is the probability to commit a type- 
I-error we are willing to accept in advance. E.g., using a prespecified 
significance level of 0.05, we reject the null hypothesis if and only if the 
p-value is less than 0.05. The significance level is chosen before the test 
is conducted. 


An equivalent procedure is to reject the null hypothesis if the observed 
test statistic is, in absolute value terms, larger than the critical value 
of the test statistic. The critical value is determined by the significance 
level chosen and defines two disjoint sets of values which are called 
acceptance region and rejection region. The acceptance region contains 
all values of the test statistic for which the test does not reject while the 
rejection region contains all the values for which the test does reject. 


The p-value is the probability that, in repeated sampling under the 
same conditions, a test statistic is observed that provides just as 
much evidence against the null hypothesis as the test statistic actually 
observed. 


The actual probability that the test rejects the true null hypothesis 
is called the size of the test. In an ideal setting, the size equals the 
significance level. 


The probability that the test correctly rejects a false null hypothesis is 
called power. 





Reconsider the pvalue computed further above: 


# check whether p-value < 0.05 
pvalue < 0.05 
#> [1] FALSE 
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The condition is not fulfilled so we do not reject the null hypothesis correctly. 


When working with a t-statistic instead, it is equivalent to apply the following 
rule: 


Reject Ho if |t°%*] > 1.96 


We reject the null hypothesis at the significance level of 5% if the computed 
t-statistic lies beyond the critical value of 1.96 in absolute value terms. 1.96 is 
the 0.975-quantile of the standard normal distribution. 


# check the critical value 
qnorm(p = 0.975) 
#> [1] 1.959964 


# check whether the null is rejected using the t-statistic computed further above 
abs(tstatistic) > 1.96 
#> [1] FALSE 


Just like using the p-value, we cannot reject the null hypothesis using the corre- 
sponding t-statistic. Key Concept 3.6 summarizes the procedure of performing 
a two-sided hypothesis test about the population mean E(Y). 


Key Concept 3.6 
Testing the Hypothesis E(Y) = uyo Against the Alternative 


E(Y) # pro 
. Estimate py using Y and compute SE(Y), the standard error of 
SE(Y). 
. Compute the t-statistic. 


. Compute the p-value and reject the null hypothesis at the 5% level 
of significance if the p-value is smaller than 0.05 or, equivalently, if 


leet] > 1.96. 





One-sided Alternatives 


Sometimes we are interested in testing if the mean is bigger or smaller than some 
value hypothesized under the null. To stick to the book, take the presumed wage 
gap between well and less educated working individuals. Since we anticipate that 
such a differential exists, a relevant alternative (to the null hypothesis that there 
is no wage differential) is that well educated individuals earn more, i.e., that the 
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average hourly wage for this group, uy is bigger than pry,o, the average wage of 
less educated workers which we assume to be known here for simplicity (Section 
@ref{cmfdp} discusses how to test the equivalence of to unknown population 
means). 


This is an example of a right-sided test and the hypotheses pair is chosen to be 


Ho : py = py, vs Ay: py > pyo. 


We reject the null hypothesis if the computed test-statistic is larger than the 
critical value 1.64, the 0.95-quantile of the M (0,1) distribution. This ensures 
that 1—0.95 = 5% probability mass remains in the area to the right of the critical 
value. As before, we can visualize this in R using the function polygon(). 


# plot the standard normal density on the domain [-4,4] 
curve (dnorm(x) , 

xlim = c(-4, 4), 

main = "Rejection Region of a Right-Sided Test", 


yaxs = "i", 

xlab = "t-statistic", 
ylab = "n 

lwd = 2, 

axes = "F") 


# add the z-axis 
axis(1, 
at = c(-4, 0, 1.64, 4), 
padj = 0.5, 
labels = c("", 0, expression(Phi~-1~(.95)==1.64), "")) 


# shade the rejection region in the left tail 
polygon(x = c(1.64, seq(1.64, 4, 0.01), 4), 
y = c(O, dnorm(seq(1.64, 4, 0.01)), 0), 
col = "darkred") 
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Rejection Region of a Right-Sided Test 


0 ©” (0.95) =1.64 


t-statistic 


Analogously, for the left-sided test we have 
Ag : py = uy, o vs. Ay: py < Hyo- 


The null is rejected if the observed test statistic falls short of the critical value 
which, for a test at the 0.05 level of significance, is given by —1.64, the 0.05- 
quantile of the M (0,1) distribution. 5% probability mass lies to the left of the 
critical value. 


It is straightforward to adapt the code chunk above to the case of a left-sided 
test. We only have to adjust the color shading and the tick marks. 


# plot the the standard normal density on the domain [-4,4] 
curve (dnorm(x) , 
xlim = c(-4, 4), 


main = "Rejection Region of a Left-Sided Test", 
yaxs = Me 

xlab = "t-statistic", 

ylab = tint 

lwd = 2, 

axes = "F") 


# add z-azis 
axis(1, 
at = ¢CC4. 0) eA 
padj = 0.5, 
labels = c("", 0, expression(Phi*-1~(.05)==-1.64), "")) 
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# shade rejection region in right tail 

polygon(x = c(-4, seq(-4, -1.64, 0.01), -1.64), 
y = c(0, dnorm(seq(-4, -1.64, 0.01)), 0), 
col = "darkred") 


Rejection Region of a Left-Sided Test 


©” (0.05) =-1.64 0 


t-statistic 


3.4 Confidence Intervals for the Population 
Mean 


As stressed before, we will never estimate the exact value of the population mean 
of Y using a random sample. However, we can compute confidence intervals 
for the population mean. In general, a confidence interval for an unknown 
parameter is a recipe that, in repeated samples, yields intervals that contain the 
true parameter with a prespecified probability, the confidence level. Confidence 
intervals are computed using the information available in the sample. Since this 
information is the result of a random process, confidence intervals are random 
variables themselves. 


Key Concept 3.7 shows how to compute confidence intervals for the unknown 
population mean E(Y). 
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Key Concept 3.7 
Confidence Intervals for the Population Mean 


A 95% confidence interval for uy is a random variable that contains 
the true py in 95% of all possible random samples. When n is large we 
can use the normal approximation. Then, 99%, 95%, 90% confidence 
intervals are 


99% confidence interval for py = |Y + 2.58 x SE(Y)], 
- 1.96 x SE(Y)], 
90% confidence interval for py = |Y + 1.64 x SE(Y)] . 


95% confidence interval for uy 


Fi 
Fi 








These confidence intervals are sets of null hypotheses we cannot reject 
in a two-sided hypothesis test at the given level of confidence. 


Now consider the following statements. 


1. In repeated sampling, the interval 


[Y 41.96 x SE(Y)| 





covers the true value of uy with a probability of 95%. 
2. We have computed Y = 5.1 and SE(Y) = 2.5 so the interval 


[5.1 + 1.96 x 2.5] = [0.2, 10] 





covers the true value of uy with a probability of 95%. 


While 1. is right (this is in line with the definition above), 2. is wrong 
and none of your lecturers wants to read such a sentence in a term paper, 
written exam or similar, believe us. The difference is that, while 1. is the 
definition of a random variable, 2. is one possible outcome of this random 
variable so there is no meaning in making any probabilistic statement 
about it. Either the computed interval does cover uy or it does not! 





In R, testing of hypotheses about the mean of a population on the basis of a 
random sample is very easy due to functions like t.test() from the stats 
package. It produces an object of type list. Luckily, one of the most simple 
ways to use t.test() is when you want to obtain a 95% confidence interval 
for some population mean. We start by generating some random data and 
calling t.test © in conjunction with 1s() to obtain a breakdown of the output 
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components. 


# set seed 
set.seed(1) 


# generate some sample data 
sampledata <- rnorm(100, 10, 10) 


# check the type of the outcome produced by t.test 
typeof (t.test (sampledata) ) 
es (G0) Vikas) 


# display the list elements produced by t.test 

1s(t.test (sampledata) ) 

#> [1] "alternative" "conf.int" "data.name" "estimate" "method" 
#> [6] "null.value" "p.value" "parameter" USEQGLESEUC™ Mstderr" 


Though we find that many items are reported, at the moment we are only 
interested in computing a 95% confidence set for the mean. 


t.test (sampledata)$"conf.int" 
#> [1] 9.306651 12.871096 
#> attr(, "conf.level") 

#> M] 0-95 


This tells us that the 95% confidence interval is 


[9.31, 12.87] . 


In this example, the computed interval obviously does cover the true uy which 
we know to be 10. 


Let us have a look at the whole standard output produced by t.test(). 


t.test(sampledata) 

#> 

#> One Sample t-test 

#> 

#> data: sampledata 

Hobe 12040) ay — 99) “p-value < (2.2e—106 
#> alternative hypothesis: true mean is not equal to O 
#> 95 percent confidence interval: 

#> 9.306651 12.871096 

#> sample estimates: 

#> mean of x 

#> 11.08887 
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We see that t.test() does not only compute a 95% confidence interval but au- 
tomatically conducts a two-sided significance test of the hypothesis Ho : uy = 0 
at the level of 5% and reports relevant parameters thereof: the alternative hy- 
pothesis, the estimated mean, the resulting t-statistic, the degrees of freedom of 
the underlying ¢ distribution (t.test() does use perform the normal approxi- 
mation) and the corresponding p-value. This is very convenient! 


In this example, we come to the conclusion that the population mean is signifi- 
cantly different from 0 (which is correct) at the level of 5%, since uy = 0 is not 
an element of the 95% confidence interval 


0 g (9.31, 12.87]. 


We come to an equivalent result when using the p-value rejection rule since 


p-value = 2.2-107!® < 0.05. 


3.5 Comparing Means from Different Popula- 
tions 


Suppose you are interested in the means of two different populations, denote 
them pı and u2. More specifically, you are interested whether these population 
means are different from each other and plan to use a hypothesis test to verify 
this on the basis of independent sample data from both populations. A suitable 
pair of hypotheses is 


Ho : m — Hz = dọ vs. Ay: py — u2 # do (3.12) 


where dọ denotes the hypothesized difference in means (so dọ = 0 when the 
means are equal, under the null hypothesis). The book teaches us that Ho can 
be tested with the t-statistic 


Y,—-Yo)- 
SE(Yı— Y2) 
where 
4. 3: 62 52 
SEY- Yə) = ++ 2. (3.14) 
NY ng 


This is called a two sample t-test. For large nı and ne, (3.13) is standard 
normal under the null hypothesis. Analogously to the simple t-test we can 
compute confidence intervals for the true difference in population means: 


3.6. AN APPLICATION TO THE GENDER GAP OF EARNINGS 89 





(Yı —Yo) +1.96 x SE(Y; — Y2) 


is a 95% confidence interval for d. In R, hypotheses as in (3.12) can be tested 
with t.test(, too. Note that t.test() chooses dp = 0 by default. This can 
be changed by setting the argument mu accordingly. 


The subsequent code chunk demonstrates how to perform a two sample t-test 
in R using simulated data. 


# set random seed 
set.seed(1) 


# draw data from two different populations with equal mean 
sample_pop1 <- rnorm(100, 10, 10) 
sample_pop2 <- rnorm(100, 10, 20) 


# perform a two sample t-test 
t.test(sample_pop1, sample_pop2) 

#> 

#> Welch Two Sample t-test 

#> 

#> data: sample popi and sample_pop2 

Pot NO 872 (ay = 170152, p Value =) 0.3047 


#> alternative hypothesis: true difference in means is not equal to 0 


#> 95 percent confidence interval: 
#> -2.338012 6.028083 

#> sample estimates: 

#> mean of x mean of y 

#> 11.088874 9.243838 


We find that the two sample t-test does not reject the (true) null hypothesis 
that do =0. 


3.6 An Application to the Gender Gap of Earn- 
ings 

This section discusses how to reproduce the results presented in the box The 

Gender Gap of Earnings of College Graduates in the United States of the book. 


In order to reproduce Table 3.1 of the book you need to download the replication 
data which are hosted by Pearson and can be downloaded here. This file contains 
data that range from 1992 to 2008 and earnings are reported in prices of 2008. 
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There are several ways to import the .xlsx-files into R. Our suggestion is the 
function read_excel() from the readxl package (Wickham and Bryan, 2019). 
The package is not part of R’s base version and has to be installed manually. 


# load the 'readzl' package 
library (readx1) 


You are now ready to import the dataset. Make sure you use the correct path 
to import the downloaded file! In our example, the file is saved in a subfolder 
of the working directory named data. If you are not sure what your current 
working directory is, use getwd(), see also ?getwd. This will give you the path 
that points to the place R is currently looking for files to work with. 


# import the data into R 
cps <- read_excel(path = "data/cps_ch3.xlsx") 


Next, install and load the package dyplr (Wickham et al., 2020). This package 
provides some handy functions that simplify data wrangling a lot. It makes use 
of the %>% operator. 


# load the 'dplyr' package 
library (dplyr) 


First, get an overview over the dataset. Next, use 4>% and some functions from 
the dplyr package to group the observations by gender and year and compute 
descriptive statistics for both groups. 


# get an overview of the data structure 
head (cps) 

#> # A tibble: 62 3 

#>  a_sex year ahe0& 

#>  <dbl> <dbl> <dbl> 


#> 1 dil OO 2a ie 
#> 2 1799215. 
#> 3 11992229 
#> 4 21992133 
#> 5 11992221 
#> 6 2 1992 12.2 


# group data by gender and year and compute the mean, standard deviation 
# and number of observations for each group 
avgs <- cps %>% 

group_by(a_sex, year) %>% 

summarise(mean(ahe08) , 
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sd(ahe0s) , 
n()) 


# print the results to the console 

print (avgs) 

#> # A tibble: 102 5 

#> # Groups: a_sex [2] 

#> a_sex year ~mean(ahe08)~ “sd(ahe08)~ “n()~ 


#> <dbl> <dbl> <dbl> <dbl> <int> 
#> 1 107992 23.3 10727597 
#> 2 101996 22.5 1031 1379 
WiC! 1 2000 24.9 11761303 
#> 4 1 2004 25.1 12.0 1894 
#> 5 1 2008 25.0 11.8 1838 
#> 6 2 1992 20.0 7.87 1368 
#> 7 2 1996 19.0 7.95 1230 
#> 8 2 2000 20.7 91390 T1181 
#> 9 2 2004 21.0 9-36 T1735 
#> 10 2 2008 20.9 9.66 1871 


With the pipe operator 4>% we simply chain different R functions that produce 
compatible input and output. In the code above, we take the dataset cps and 
use it as an input for the function group_by(). The output of group_by is 
subsequently used as an input for summarise() and so forth. 


Now that we have computed the statistics of interest for both genders, we can 
investigate how the gap in earnings between both groups evolves over time. 


# split the dataset by gender 
male <- avgs %,>% dplyr::filter(a_sex == 1) 


female <- avgs %>% dplyr::filter(a_sex == 2) 
# rename columns of both splits 


colnames (male) <Sec@Sex Waal Voor a Valbarem| ne’ a mem 
colnames(female) <- c("Sex", "Year", "Y_bar_f", "sf", "nf 


) 
") 


# estimate gender gaps, compute standard errors and confidence intervals for all dates 
gap <- male$Y_bar_m - female$Y_bar_f 


gap_se <- sqrt(male$s_m^2 / male$n_m + female$s_f°2 / female$n_f) 
gap_ci_l <- gap - 1.96 * gap_se 


gap_ci_u <- gap + 1.96 * gap_se 
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result <- cbind(male[,-1], female[,-(1:2)], gap, gap_se, gap_ci_l, 
# print the results to the console 

print(result, digits = 3) 

#> Vean Yibarsm some nom oari f SAn] gap gap selgap cil 
#> 1 1992 23.3 10.2 1594 20.0 7.87 1368 3.23 0.332 2.58 
#> 2 1996 221o MO al LSTI. 192017795123029 79 mO 2.80 
#> 3 2000 PA A EI O 1303 ZO eT O96 LLOL ey lA nO) aed Saoe 
#> 4 2004 ronal aly (0) aks feVA PRERO) Whaceley she fetey PA 10 (Ola okey s} 3.40 
#> 5 2008 COLO MTS 1838 20791966 18717 1007357 3241 


We observe virtually the same results as the ones presented in the book. The 
computed statistics suggest that there is a gender gap in earnings. Note that 
we can reject the null hypothesis that the gap is zero for all periods. Further, 
estimates of the gap and bounds of the 95% confidence intervals indicate that 
the gap has been quite stable in the recent past. 


3.7 Scatterplots, Sample Covariance and Sam- 
ple Correlation 


A scatter plot represents two dimensional data, for example n observation on 
X; and Y;, by points in a coordinate system. It is very easy to generate scatter 
plots using the plot () function in R. Let us generate some artificial data on age 
and earnings of workers and plot it. 


# set random seed 
set.seed(123) 


# generate dataset 

X <- runif(n = 100, 
min = 18, 
max = 70) 


Y <- X + rnorm(n=100, 50, 15) 


# plot observations 


plot (Xx, 
Y, 
type = "p", 
main = "A Scatterplot of X and Y", 
xillab = "Age™, 


ylab = "Earnings", 


gap_ci_u) 


gap_ct_u 
3.88 
Ltg 
PAR A 
4.80 
4.80 
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The plot shows positive correlation between age and earnings. This is in line 
with the notion that older workers earn more than those who joined the working 
population recently. 


Sample Covariance and Correlation 


By now you should be familiar with the concepts of variance and covariance. If 
not, we recommend you to work your way through Chapter 2 of the book. 


Just like the variance, covariance and correlation of two variables are properties 
that relate to the (unknown) joint probability distribution of these variables. 
We can estimate covariance and correlation by means of suitable estimators 
using a sample (X;, Y;), i =1,...,n. 


The sample covariance 


SXY = 
n : 
{=l 


is an estimator for the population variance of X and Y whereas the sample 
correlation 





rxy = 
SX SY 


can be used to estimate the population correlation, a standardized measure for 
the strength of the linear relationship between X and Y. See Chapter 3.7 in 
the book for a more detailed treatment of these estimators. 
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As for variance and standard deviation, these estimators are implemented as 
R functions in the stats package. We can use them to estimate population 
covariance and population correlation of the artificial data on age and earnings. 


# compute sample covariance of X and Y 
cov(X, Y) 
#> [1] 213.934 


# compute sample correlation between X and Y 
cor(X, Y) 
#> [1] 0.706372 


# an equivalent way to compute the sample correlation 
cov(X, Y) / (sa(X) * sd(Y)) 
#> [1] 0.706372 


The estimates indicate that X and Y are moderately correlated. 


The next code chunk uses the function mvnorm() from package MASS (Ripley, 
2020) to generate bivariate sample data with different degrees of correlation. 


library (MASS) 


# set random seed 
set.seed(1) 


# positive correlation (0.81) 

example1 <- mvrnorm(100, 
mu = c(O, 0), 
Sigma = matrix(c(2, 2, 2, 3), ncol = 2), 
empirical = TRUE) 


# negative correlation (-0.81) 

example2 <- mvrnorm(100, 
mu = c(0, 0), 
Sigma = matrix(c(2, -2, -2, 3), ncol = 2), 
empirical = TRUE) 


# no correlation 

example3 <- mvrnorm(100, 
mu = c(0, 0), 
Sigma = matrix(c(1, 0, 0, 1), ncol = 2), 
empirical = TRUE) 


# no correlation (quadratic relationship) 
X <- seq(-3, 3, 0.01) 
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Y <- - X°2 + rnorm(length(X)) 
example4 <- cbind(X, Y) 


# divide plot area as 2-by-2 array 
par(mfrow = c(2, 2)) 


# plot datasets 
plot(example1, col = "steelblue", pch = 20, xlab = 


main = "Correlation = 0.814) 


plot(example2, col = "steelblue", pch = 20, xlab = 
main = "Correlation = -0.81") 


plot (example3, col = "steelblue", pch = 20, xlab = 
main = "Correlation = 0") 


plot(example4, col = "steelblue", pch = 20, xlab = 
main = "Correlation = 0") 


Correlation = 0.81 
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3.8 Exercises 


This interactive part of the book is only available in the HTML version. 
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uyu ; 


wyn 5 


"yu 5 


uyu ; 
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CHAPTER 3. A REVIEW OF STATISTICS USING R 


Chapter 4 


Linear Regression with One 
Regressor 


This chapter introduces the basics in linear regression and shows how to perform 
regression analysis in R. In linear regression, the aim is to model the relationship 
between a dependent variable Y and one or more explanatory variables denoted 
by X1, X2,..., Xp. Following the book we will focus on the concept of simple 
linear regression throughout the whole chapter. In simple linear regression, 
there is just one explanatory variable X,. If, for example, a school cuts its class 
sizes by hiring new teachers, that is, the school lowers X1, the student-teacher 
ratios of its classes, how would this affect Y, the performance of the students 
involved in a standardized test? With linear regression we can not only examine 
whether the student-teacher ratio does have an impact on the test results but 
we can also learn about the direction and the strength of this effect. 


The following packages are needed for reproducing the code presented in this 
chapter: 


e AER - accompanies the Book Applied Econometrics with R Kleiber and 
Zeileis (2008) and provides useful functions and data sets. 


e MASS - a collection of functions for applied statistics. 


Make sure these are installed before you go ahead and try to replicate the 
examples. The safest way to do so is by checking whether the following code 
chunk executes without any errors. 


library (AER) 
library (MASS) 


97 


98 CHAPTER 4. LINEAR REGRESSION WITH ONE REGRESSOR 
4.1 Simple Linear Regression 


To start with an easy example, consider the following combinations of average 
test score and the average student-teacher ratio in some fictional school districts. 





1 2 3 4 5 6 7 
TestScore 680 640 670 660 630 660.0 635 


STR 15 17 19 20 22 23.5 25 


To work with these data in R we begin by generating two vectors: one for the 
student-teacher ratios (STR) and one for test scores (TestScore), both contain- 
ing the data from the table above. 


# Create sample data 
Sw << OC, a, WS Ay Pe, PR, PSD) 
TestScore <- c(680, 640, 670, 660, 630, 660, 635) 


# Print out sample data 

STR 

Ho [ISO MIAO) 1970207012210233 250 
TestScore 

#> [1] 680 640 670 660 630 660 635 


To build simple linear regression model, we hypothesize that the relationship 
between dependent and independent variable is linear, formally: 


Y=b-X +a. 


For now, let us suppose that the function which relates test score and student- 
teacher ratio to each other is 


TestScore = 713 — 3 x STR. 


It is always a good idea to visualize the data you work with. Here, it is suitable to 
use plot() to produce a scatterplot with STR on the z-axis and TestScore on 
the y-axis. Just call plot(y_variable ~ x_variable) whereby y_variable 
and x_variable are placeholders for the vectors of observations we want to 
plot. Furthermore, we might want to add a systematic relationship to the plot. 
To draw a straight line, R provides the function abline(). We just have to call 
this function with arguments a (representing the intercept) and b (representing 
the slope) after executing plot () in order to add the line to our plot. 


The following code reproduces Figure 4.1 from the textbook. 
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# create a scatterplot of the data 
plot(TestScore ~ STR) 


# add the systematic relationship to the plot 
abline(a = 713, b = -3) 
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We find that the line does not touch any of the points although we claimed that 
it represents the systematic relationship. The reason for this is randomness. 
Most of the time there are additional influences which imply that there is no 
bivariate relationship between the two variables. 


In order to account for these differences between observed data and the system- 
atic relationship, we extend our model from above by an error term u which 
captures additional random effects. Put differently, u accounts for all the dif- 
ferences between the regression line and the actual observed data. Beside pure 
randomness, these deviations could also arise from measurement errors or, as 
will be discussed later, could be the consequence of leaving out other factors 
that are relevant in explaining the dependent variable. 


Which other factors are plausible in our example? For one thing, the test scores 
might be driven by the teachers’ quality and the background of the students. It 
is also possible that in some classes, the students were lucky on the test days 
and thus achieved higher scores. For now, we will summarize such influences by 
an additive component: 


TestScore = Bo + 8; x STR + other factors 
Of course this idea is very general as it can be easily extended to other situations 


that can be described with a linear model. The basic linear regression model 
we will work with hence is 


Yi = Bot Pi Xi + ui. 
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Key Concept 4.1 summarizes the terminology of the simple linear regression 
model. 


Key Concept 4.1 
Terminology for the Linear Regression Model with a Single 
Regressor 


The linear regression model is 


MG = [hy ar SG ae Oi 


where 
the index 7 runs over the observations, i = 1,...,n 


Y; is the dependent variable, the regressand, or simply the left-hand 
variable 


X; is the independent variable, the regressor, or simply the right- 
hand variable 


Y = o + (iX is the population regression line also called the 
population regression function 


Bo is the intercept of the population regression line 
61 is the slope of the population regression line 


u; is the error term. 





4.2 Estimating the Coefficients of the Linear 
Regression Model 


In practice, the intercept 6p and slope 8ı of the population regression line are 
unknown. Therefore, we must employ data to estimate both unknown parame- 
ters. In the following, a real world example will be used to demonstrate how this 
is achieved. We want to relate test scores to student-teacher ratios measured in 
Californian schools. The test score is the district-wide average of reading and 
math scores for fifth graders. Again, the class size is measured as the number of 
students divided by the number of teachers (the student-teacher ratio). As for 
the data, the California School data set (CASchools) comes with an R package 
called AER, an acronym for Applied Econometrics with R (Kleiber and Zeileis, 
2020). After installing the package with install.packages("AER") and at- 
taching it with library(AER) the data set can be loaded using the function 
data(). 
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## # install the AER package (once) 
## install. packages ("AER") 

## 

## # load the AER package 

library (AER) 


# load the the data set in the workspace 
data(CASchools) 


Once a package has been installed it is available for use at further occasions 
when invoked with library () — there is no need to run install. packages () 
again! 


It is interesting to know what kind of object we are dealing with. class() 
returns the class of an object. Depending on the class of an object some functions 
(for example plot () and summary()) behave differently. 


Let us check the class of the object CASchools. 


class (CASchools) 
#> [1] "data. frame" 


It turns out that CASchool1s is of class data. frame which is a convenient format 
to work with, especially for performing regression analysis. 


With help of head() we get a first overview of our data. This function shows 
only the first 6 rows of the data set which prevents an overcrowded console 
output. 





Press ctrl + L to clear the console. This command deletes any code 
that has been typed in and executed by you or printed to the console 
by R functions. The good news is that anything else is left untouched. 
You neither loose defined variables etc. nor the code history. It is still 
possible to recall previously executed R commands using the up and 
down keys. If you are working in RStudio, press ctrl + Up on your 
keyboard (CMD + Up on a Mac) to review a list of previously entered 











commands. 
head (CASchools) 
#> district school county grades students teachers 
#> 1 75119 Sunol Glen Unified Alameda KK-08 195 10.90 
#> 2 61499 Manzanita Elementary Butte Kk-08 240 11:15 
#> 3 61549 Thermalito Union Elementary Butte KK-08 1550 82.90 
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#> 4 61457 Golden Feather Union Elementary Butte KK=08 243 14.00 
#> 5 61523 Palermo Union Elementary Butte KK-08 1335 71.50 
#> 6 62042 Burrel Union Elementary Fresno KK-08 137 6.40 
#>  calworks lunch computer expenditure income english read math 
#> 1 0.5102 2.0408 67 6384.911 22.690001 0.000000 691.6 690.0 
#> 2 15.4167 47.9167 101 5099.381 9.824000 4.583333 660.5 661.9 
#> 3 55.0323 ‘76.3226 169 5501.955 8.978000 30.000002 636.3 650.9 
#> 4 36.4754 ‘77.0492 85 7101.831 8.978000 0.000000 651.9 643.5 
#> 5 33.1086 78.4270 I 5235.988 9.080333 13.857677 641.8 639.9 
#> 6 12.3188 86.9565 25 5580.147 10.415000 12.408759 605.7 605.4 


We find that the data set consists of plenty of variables and that most of them 
are numeric. 


By the way: an alternative to class() and head() is str() which is deduced 
from ‘structure’ and gives a comprehensive overview of the object. Try! 


Turning back to CASchoo1s, the two variables we are interested in (i.e., average 
test score and the student-teacher ratio) are not included. However, it is possible 
to calculate both from the provided data. To obtain the student-teacher ratios, 
we simply divide the number of students by the number of teachers. The average 
test score is the arithmetic mean of the test score for reading and the score of the 
math test. The next code chunk shows how the two variables can be constructed 
as vectors and how they are appended to CASchool1s. 


# compute STR and append it to CASchools 
CASchools$STR <- CASchools$students/CASchools$teachers 


# compute TestScore and append it to CASchools 
CASchools$score <- (CASchools$read + CASchools$math) /2 


If we ran head(CASchools) again we would find the two variables of interest as 
additional columns named STR and score (check this!). 


Table 4.1 from the textbook summarizes the distribution of test scores and 
student-teacher ratios. There are several functions which can be used to produce 
similar results, e.g., 


e mean() (computes the arithmetic mean of the provided numbers), 
e sd() (computes the sample standard deviation), 
e quantile() (returns a vector of the specified sample quantiles for the 


data). 


The next code chunk shows how to achieve this. First, we compute summary 
statistics on the columns STR and score of CASchools. In order to get nice 
output we gather the measures in a data.frame named DistributionSummary. 
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# compute sample averages of STR and score 
avg_STR <- mean(CASchools$STR) 
avg_score <- mean(CASchools$score) 


# compute sample standard deviations of STR and score 
sd_STR <- sd(CASchools$STR) 
sd_score <- sd(CASchools$score) 


# set up a vector of percentiles and compute the quantiles 
quantiles <- c(0.10, 0.25, 0.4, 0-5, 0.6, 0.75, 0.9) 
quant_STR <- quantile(CASchools$STR, quantiles) 
quant_score <- quantile(CASchools$score, quantiles) 


# gather everything in a data. frame 

DistributionSummary <- data.frame(Average = c(avg_STR, avg_score), 
StandardDeviation = c(sd_STR, sd_score), 
quantile = rbind(quant_STR, quant_score)) 


# print the summary to the console 
DistributionSummary 


#> Average StandardDeviation quantile.10. quantile.25. quantile. 40. 
#> quant_STR 19.64043 1.891812 17.3486 18. 58236 19.26618 
#> quant_score 654.15655 19.053347 630. 3950 640. 05000 649. 06999 
#> quantile.50. quantile.60. quantile.75. quantile.90. 
#> quant_STR 19. 72321 20.0783 20.87181 21.86741 
#> quant_score 654.45000 659.4000 666. 66249 678. 85999 


As for the sample data, we use plot (). This allows us to detect characteristics 
of our data, such as outliers which are harder to discover by looking at mere 
numbers. This time we add some additional arguments to the call of plot (). 


The first argument in our call of plot(), score ~ STR, is again a formula that 
states variables on the y- and the x-axis. However, this time the two variables 
are not saved in separate vectors but are columns of CASchools. Therefore, R 
would not find them without the argument data being correctly specified. data 
must be in accordance with the name of the data.frame to which the variables 
belong to, in this case CASchools. Further arguments are used to change the 
appearance of the plot: while main adds a title, xlab and ylab add custom 
labels to both axes. 


plot(score ~ STR, 
data = CASchools, 
main = "Scatterplot of TestScore and STR", 
xlab = "STR (X)", 
ylab = "Test Score (Y)") 
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Scatterplot of TestScore and STR 
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The plot (Figure 4.2 in the book) shows the scatterplot of all observations on 
the student-teacher ratio and test score. We see that the points are strongly 
scattered, and that the variables are negatively correlated. That is, we expect 
to observe lower test scores in bigger classes. 


The function cor() (see ?cor for further info) can be used to compute the 
correlation between two numeric vectors. 


cor(CASchools$STR, CASchools$score) 
#> [1] -0.2263627 


As the scatterplot already suggests, the correlation is negative but rather weak. 


The task we are now facing is to find a line which best fits the data. Of course 
we could simply stick with graphical inspection and correlation analysis and 
then select the best fitting line by eyeballing. However, this would be rather 
subjective: different observers would draw different regression lines. On this 
account, we are interested in techniques that are less arbitrary. Such a technique 
is given by ordinary least squares (OLS) estimation. 


The Ordinary Least Squares Estimator 


The OLS estimator chooses the regression coefficients such that the estimated 
regression line is as “close” as possible to the observed data points. Here, close- 
ness is measured by the sum of the squared mistakes made in predicting Y given 
X. Let bp and bı be some estimators of 85 and 61. Then the sum of squared 
estimation mistakes can be expressed as 


n 


Si (Yi — bo — XG)’. 


i=l 
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The OLS estimator in the simple regression model is the pair of estimators for 
intercept and slope which minimizes the expression above. The derivation of 
the OLS estimators for both parameters are presented in Appendix 4.1 of the 
book. The results are summarized in Key Concept 4.2. 


Key Concept 4.2 
The OLS Estimator, Predicted Values, and Residuals 


The OLS estimators of the slope 6; and the intercept ĝo in the simple 
linear regression model are 





The estimated intercept Bo, the slope parameter ĝi and the residuals 
(û;) are computed from a sample of n observations of X; and Yj, i, ..., 
n. These are estimates of the unknown true population intercept (80), 
slope (61), and error term (u;). 





There are many possible ways to compute Bo and Bi in R. For example, we could 
implement the formulas presented in Key Concept 4.2 with two of R’s most basic 
functions: mean() and sum(). Before doing so we attach the CASchool1s dataset. 


attach(CASchools) # allows to use the variables contained in CASchools directly 


# compute beta_1_hat 
beta_1 <- sum((STR - mean(STR)) * (score - mean(score))) / sum((STR - mean(STR))“~2) 


# compute beta_O_hat 
beta_O <- mean(score) - beta_1 * mean(STR) 


# print the results to the console 
beta_i 

#> [1] -2.279808 

beta_0O 

#> [1] 698.9329 
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Calling attach(CASchools) enables us to adress a variable contained in 
CASchools by its name: it is no longer necessary to use the $ operator 
in conjunction with the dataset: R may evaluate the variable name 
directly. 

R uses the object in the user environment if this object shares the name 
of variable contained in an attached database. However, it is a better 
practice to always use distinctive names in order to avoid such (seem- 
ingly) ambivalences! 











Notice that we adress variables contained in the attached dataset 
CASchools directly for the rest of this chapter! 


Of course, there are even more manual ways to perform these tasks. With OLS 
being one of the most widely-used estimation techniques, R of course already 
contains a built-in function named 1m() (linear model) which can be used to 
carry out regression analysis. 


The first argument of the function to be specified is, similar to plot(), the 
regression formula with the basic syntax y ~ x where y is the dependent variable 
and x the explanatory variable. The argument data determines the data set 
to be used in the regression. We now revisit the example from the book where 
the relationship between the test scores and the class sizes is analyzed. The 
following code uses 1m() to replicate the results presented in figure 4.3 of the 
book. 


# estimate the model and assign the result to linear_model 
linear_model <- lm(score ~ STR, data = CASchools) 


# print the standard output of the estimated lm object to the console 
linear_model 

#> 

#> Call: 

#> lm(formula = score ~ STR, data = CASchools) 


#> Coefficients: 
#> (Intercept) STR 
#> 698.93 -2.28 


Let us add the estimated regression line to the plot. This time we also enlarge 
the ranges of both axes by setting the arguments xlim and ylim. 


# plot the data 
plot(score ~ STR, 
data = CASchools, 
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main = "Scatterplot of TestScore and STR", 
xlab = "STR (X)", 

ylab = "Test Score (Y)", 

xlim = c(10, 30), 

ylim = c(600, 720)) 


# add the regression line 
abline(linear_model) 


Scatterplot of TestScore and STR 
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Did you notice that this time, we did not pass the intercept and slope parameters 
to abline? If you call abline() on an object of class 1m which only contains a 
single regressor, R draws the regression line automatically! 


4.3 Measures of Fit 


After fitting a linear regression model, a natural question is how well the model 
describes the data. Visually, this amounts to assessing whether the observations 
are tightly clustered around the regression line. Both the coefficient of deter- 
mination and the standard error of the regression measure how well the OLS 
Regression line fits the data. 


The Coefficient of Determination 


R?, the coefficient of determination, is the fraction of the sample variance of Y; 
that is explained by X;. Mathematically, the R? can be written as the ratio of 
the explained sum of squares to the total sum of squares. The ezplained sum of 
squares (ESS) is the sum of squared deviations of the predicted values Y;, from 
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the average of the Y;. The total sum of squares (TSS) is the sum of squared 
deviations of the Y; from their average. Thus we have 


pss=37 (%-¥), (4.1) 


w=1 
PsS=). (Yay, (4.2) 
w=1 
» _ ESS 
ae (4.3) 


Since TSS = ESS + SSR we can also write 


SSR 
2 L (percept 
Á TSS 


where SSR is the sum of squared residuals, a measure for the errors made when 
predicting the Y by X. The SSR is defined as 


SSR = 5 a. 
w=1 


R? lies between 0 and 1. It is easy to see that a perfect fit, i.e., no errors made 
when fitting the regression line, implies R? = 1 since then we have SSR = 0. 
On the contrary, if our estimated regression line does not explain any variation 
in the Y;, we have ESS = 0 and consequently R? = 0. 


The Standard Error of the Regression 


The Standard Error of the Regression (SER) is an estimator of the standard 
deviation of the residuals &;. As such it measures the magnitude of a typical 
deviation from the regression line, i.e., the magnitude of a typical residual. 








Remember that the u; are unobserved. This is why we use their estimated 
counterparts, the residuals &;, instead. See Chapter 4.3 of the book for a more 
detailed comment on the SER. 
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Application to the Test Score Data 


Both measures of fit can be obtained by using the function summary() with an 
1m object provided as the only argument. While the function 1m() only prints 
out the estimated coefficients to the console, summary() provides additional 
predefined information such as the regression’s R? and the SER. 


mod_summary <- summary (linear_model) 
mod_summary 


#> 

#> Call: 

#> lm(formula = score ~ STR, data = CASchools) 

#> 

#> Residuals: 

#> Min 1Q Median 3Q Maz 

#> -47.727 -14.251 0.483 12.822 48.540 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 698.9329 9.4675 73.825 < 2e-16 *** 

#> STR -2.2798 0.4798 -4.751 2.78e-06 *** 

#> --- 

HOVSCQNU Ii Codeso ON kr MOOI EEK: MORO 1. UNO OS) EEO de! eed 
#> 

#> Residual standard error: 18.58 on 418 degrees of freedom 

#> Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 


#> Fo-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06 


The R? in the output is called Multiple R-squared and has a value of 0.051. 
Hence, 5.1% of the variance of the dependent variable score is explained by the 
explanatory variable ST R. That is, the regression explains little of the variance 
in score, and much of the variation in test scores remains unexplained (cf. Figure 
4.3 of the book). 


The SER is called Residual standard error and equals 18.58. The unit of the 
SER is the same as the unit of the dependent variable. That is, on average 
the deviation of the actual achieved test score and the regression line is 18.58 
points. 


Now, let us check whether summary () uses the same definitions for R? and SER 
as we do when computing them manually. 


# compute R°2 manually 

SSR <- sum(mod_summary$residuals*2) 
TSS <- sum((score - mean(score))~2) 
R2 <- 1 - SSR/TSS 


110 CHAPTER 4. LINEAR REGRESSION WITH ONE REGRESSOR 


# print the value to the console 
R2 
#> [1] 0.05124009 


# compute SER manually 
n <- nrow(CASchools) 
SER <- sqrt(SSR / (n-2)) 


# print the value to the console 
SER 
#> [1] 18.58097 


We find that the results coincide. Note that the values provided by summary () 
are rounded to two decimal places. 


4.4 The Least Squares Assumptions 


OLS performs well under a quite broad variety of different circumstances. How- 
ever, there are some assumptions which need to be satisfied in order to ensure 
that the estimates are normally distributed in large samples (we discuss this in 
Chapter 4.5. 


Key Concept 4.3 
The Least Squares Assumptions 


Wj = [Bly te Bek se Oh, C= Nosy 


where 


1. The error term u; has conditional mean zero given X;: E(u;|X;) = 


0. 


. (Xi, Yı), i = 1,...,n are independent and identically distributed 
(i.id.) draws from their joint distribution. 


. Large outliers are unlikely: X; and Y; have nonzero finite fourth 
moments. 





Assumption 1: The Error Term has Conditional Mean of 
Zero 


This means that no matter which value we choose for X, the error term u must 
not show any systematic pattern and must have a mean of 0. Consider the case 
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that, unconditionally, E(w) = 0, but for low and high values of X, the error 
term tends to be positive and for midrange values of X the error tends to be 
negative. We can use R to construct such an example. To do so we generate 
our own data using R’s built-in random number generators. 


We will use the following functions: 


e runif() - generates uniformly distributed random numbers 

e rnorm() - generates normally distributed random numbers 

e predict() - does predictions based on the results of model fitting func- 
tions like 1m() 

e lines() - adds line segments to an existing plot 


We start by creating a vector containing values that are uniformly distributed on 
the interval [—5, 5]. This can be done with the function runif(). We also need 
to simulate the error term. For this we generate normally distributed random 
numbers with a mean equal to 0 and a variance of 1 using rnorm(). The Y 
values are obtained as a quadratic function of the X values and the error. 


After generating the data we estimate both a simple regression model and a 
quadratic model that also includes the regressor X? (this is a multiple regres- 
sion model, see Chapter 6). Finally, we plot the simulated data and add the 
estimated regression line of a simple regression model as well as the predictions 
made with a quadratic model to compare the fit graphically. 


# set a seed to make the results reproducible 
set .seed (321) 


# simulate the data 
X <- runif(50, min = -5, max = 5) 
u <- rnorm(50, sd = 1) 


# the true relation 
Y<- X°2+2x* X+u 


# estimate a simple regression model 
mod_simple <- 1m(Y ~ X) 


# predict using a quadratic model 
prediction <- predict(lm(Y ~ X + I1(X°2)), data.frame(X = sort(X))) 


# plot the results 

plot Y ~ X) 

abline(mod_simple, col = "red") 
lines(sort(X), prediction) 
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The plot shows what is meant by E(u;|X;) = 0 and why it does not hold for 
the linear model: 


Using the quadratic model (represented by the black curve) we see that there 
are no systematic deviations of the observation from the predicted relation. It 
is credible that the assumption is not violated when such a model is employed. 
However, using a simple linear regression model we see that the assumption is 
probably violated as E(u;|X;) varies with the X;. 


Assumption 2: Independently and Identically Distributed 
Data 


Most sampling schemes used when collecting data from populations produce 
i.i.d.-samples. For example, we could use R’s random number generator to ran- 
domly select student IDs from a university’s enrollment list and record age X 
and earnings Y of the corresponding students. This is a typical example of 
simple random sampling and ensures that all the (X;, Y;) are drawn randomly 
from the same population. 


A prominent example where the i.i.d. assumption is not fulfilled is time series 
data where we have observations on the same unit over time. For example, 
take X as the number of workers in a production company over time. Due to 
business transformations, the company cuts jobs periodically by a specific share 
but there are also some non-deterministic influences that relate to economics, 
politics etc. Using R we can easily simulate such a process and plot it. 


We start the series with a total of 5000 workers and simulate the reduction of 
employment with an autoregressive process that exhibits a downward movement 
in the long-run and has normally distributed errors:! 


employment, = —5 + 0.98 - employment;_1 + uz 





1See Chapter 14 for more on autoregressive processes and time series analysis in general. 
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# set seed 
set.seed(123) 


# generate a date vector 
Date <- seq(as.Date("1951/1/1"), as.Date("2000/1/1"), "years") 


# initialize the employment vector 
X <- c(5000, rep(NA, length(Date)-1)) 


# generate time series observations with random influences 
for (i in 2:length(Date)) { 


XE <- -50 + 0.98 * X[i-1] + rnorm(n = 1, sd = 200) 


#plot the results 
plot(x = Date, 


y=X, 
type = CDAS 
col = "steelblue", 


ylab = "Workers", 
xlab = "Time") 
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It is evident that the observations on the number of employees cannot be in- 
dependent in this example: the level of today’s employment is correlated with 
tomorrows employment level. Thus, the i.i.d. assumption is violated. 
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Assumption 3: Large Outliers are Unlikely 


It is easy to come up with situations where extreme observations, i.e., obser- 
vations that deviate considerably from the usual range of the data, may occur. 
Such observations are called outliers. Technically speaking, assumption 3 re- 
quires that X and Y have a finite kurtosis.” 


Common cases where we want to exclude or (if possible) correct such outliers is 
when they are apparently typos, conversion errors or measurement errors. Even 
if it seems like extreme observations have been recorded correctly, it is advisable 
to exclude them before estimating a model since OLS suffers from sensitivity to 
outliers. 


What does this mean? One can show that extreme observations receive heavy 
weighting in the estimation of the unknown regression coefficients when using 
OLS. Therefore, outliers can lead to strongly distorted estimates of regression 
coefficients. To get a better impression of this issue, consider the following 
application where we have placed some sample data on X and Y which are 
highly correlated. The relation between X and Y seems to be explained pretty 
well by the plotted regression line: all of the white data points lie close to the 
red regression line and we have R? = 0.92. 


Now go ahead and add a further observation at, say, (18,2). This observations 
clearly is an outlier. The result is quite striking: the estimated regression line 
differs greatly from the one we adjudged to fit the data well. The slope is 
heavily downward biased and R? decreased to a mere 29%! Double-click inside 
the coordinate system to reset the app. Feel free to experiment. Choose different 
coordinates for the outlier or add additional ones. 


The following code roughly reproduces what is shown in figure 4.5 in the book. 
As done above we use sample data generated using R’s random number functions 
rnorm() and runif(). We estimate two simple regression models, one based 
on the original data set and another using a modified set where one observation 
is change to be an outlier and then plot the results. In order to understand the 
complete code you should be familiar with the function sort () which sorts the 
entries of a numeric vector in ascending order. 


# set seed 
set.seed(123) 


# generate the data 

X <- sort(runif(10, min = 30, max = 70)) 
Y <- rnorm(10 , mean = 200, sd = 50) 
Y[9] <- 2000 


# fit model with outlier 





?See Chapter 4.4 of the book. 
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fit <- Im(Y ~ X) 


# fit model without outlier 
fitWithoutOutlier <- 1m(Y[-9] ~ X[-9]) 


# plot the results 

plot(Y ~ X) 

abline (fit) 

abline(fitWithoutOutlier, col = "red") 
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4.5 The Sampling Distribution of the OLS Esti- 
mator 


Because ĝo and Bi are computed from a sample, the estimators themselves 
are random variables with a probability distribution — the so-called sampling 
distribution of the estimators — which describes the values they could take on 
over different samples. Although the sampling distribution of Bo and By can 
be complicated when the sample size is small and generally changes with the 
number of observations, n, it is possible, provided the assumptions discussed in 
the book are valid, to make certain statements about it that hold for all n. In 
particular 


E(89) = Bo and E(f1) = fa, 


that is, 89 and Ê; are unbiased estimators of 89 and 61, the true parameters. If 
the sample is sufficiently large, by the central limit theorem the joint sampling 
distribution of the estimators is well approximated by the bivariate normal 
distribution (2.1). This implies that the marginal distributions are also normal 
in large samples. Core facts on the large-sample distributions of Bo and ĝi are 
presented in Key Concept 4.4. 
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Key Concept 4.4 7 3 
Large Sample Distribution of 6o and 61 


If the least squares assumptions in Key Concept 4.3 hold, then in large 

samples o and fı have a joint normal sampling distribution. The large 

sample normal distribution of 8; is N (61, oF ), where the variance of the 
2 


distribution, o 
By 
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_ 1Var|(Xi-— ux)u:] 


A n [Var Co (a 





o 


The large sample normal distribution of Êo is N (6o, a7.) with 


p 1 Var (Hiui) 
o = —-——— 
Pon [E(H?)| 


The interactive simulation below continuously generates random sam- 
ples (X;, Y;) of 200 observations where E(Y|X) = 100 + 3X, estimates 
a simple regression model, stores the estimate of the slope ĝı and vi- 
sualizes the distribution of the ıs observed so far using a histogram. 
The idea here is that for a large number of 61s, the histogram gives a 
good approximation of the sampling distribution of the estimator. By 
decreasing the time between two sampling iterations, it becomes clear 
that the shape of the histogram approaches the characteristic bell shape 
of a normal distribution centered at the true slope of 3. 


, where H; =1— ite Kio (4.5) 


EXA 


This interactive part of the book is only available in the HTML version. 





Simulation Study 1 


Whether the statements of Key Concept 4.4 really hold can also be verified 
using R. For this we first we build our own population of 100000 observations 
in total. To do this we need values for the independent variable X, for the 
error term u, and for the parameters 6) and 61. With these combined in a 
simple regression model, we compute the dependent variable Y. In our example 
we generate the numbers X;, i = 1, ... ,100000 by drawing a random sample 
from a uniform distribution on the interval [0,20]. The realizations of the error 
terms u; are drawn from a standard normal distribution with parameters u = 0 
and ø? = 100 (note that rnorm() requires ø as input for the argument sd, see 
?rnorm). Furthermore we chose 69 = —2 and 61 = 3.5 so the true model is 


Y; = —2 +43.5: Xi. 


Finally, we store the results in a data.frame. 
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# simulate data 

N <- 100000 

X <- runif(N, min = 0, max = 20) 
u <- rnorm(N, sd = 10) 


# population regression 
Y< 2 + 3). * Xe ü 
population <- data.frame(X, Y) 


From now on we will consider the previously generated data as the true popula- 
tion (which of course would be unknown in a real world application, otherwise 
there would be no reason to draw a random sample in the first place). The 
knowledge about the true population and the true relationship between Y and 
X can be used to verify the statements made in Key Concept 4.4. 


First, let us calculate the true variances 3 and o% for a randomly drawn 
0 1 


sample of size n = 100. 


# set sample size 
n <- 100 


# compute the variance of beta_hat_0O 
H_i <- 1 - mean(X) / mean(X*2) * X 
var_bO <- var(H_i * u) / (n * mean(H_i*2)°2 ) 


# compute the variance of hat_beta_1 
var_b1 <- var( ( X - mean(X) ) * u ) / (100 * var(X)*2) 


# print variances to the console 
var_b0O 

#> [1] 4.045066 

var_bl 

#> [1] 0.03018694 


Now let us assume that we do not know the true values of 89 and 6; and that 
it is not possible to observe the whole population. However, we can observe a 
random sample of n observations. Then, it would not be possible to compute 
the true parameters but we could obtain estimates of 39 and 6, from the sample 
data using OLS. However, we know that these estimates are outcomes of ran- 
dom variables themselves since the observations are randomly sampled from the 
population. Key Concept 4.4 describes their distributions for large n. When 
drawing a single sample of size n it is not possible to make any statement about 
these distributions. Things change if we repeat the sampling scheme many times 
and compute the estimates for each sample: using this procedure we simulate 
outcomes of the respective distributions. 
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To achieve this in R, we employ the following approach: 


e We assign the number of repetitions, say 10000, to reps and then initialize 
a matrix fit were the estimates obtained in each sampling iteration shall 
be stored row-wise. Thus fit has to be a matrix of dimensions reps x2. 

e Inthe next step we draw reps random samples of size n from the popula- 
tion and obtain the OLS estimates for each sample. The results are stored 
as row entries in the outcome matrix fit. This is done using a for() 
loop. 

e At last, we estimate variances of both estimators using the sampled out- 
comes and plot histograms of the latter. We also add a plot of the density 
functions belonging to the distributions that follow from Key Concept 4.4. 
The function bquote() is used to obtain math expressions in the titles and 
labels of both plots. See ?bquote. 


# set repetitions and sample size 
n <- 100 
reps <- 10000 


# initialize the matriz of outcomes 
fit <- matrix(ncol = 2, nrow = reps) 


# loop sampling and estimation of the coefficients 
for (i in 1:reps){ 


sample <- population[sample(1:N, n), ] 
fitli, ] <- 1lm(Y ~ X, data = sample)$coefficients 


# compute variance estimates using outcomes 
var(fit[, 1]) 

#> [1] 4.186832 

var(fit[, 2]) 

#> [1] 0.03096199 


# divide plotting area as 1-by-2 array 
par(mfrow = c(1, 2)) 


# plot histograms of beta_O estimates 
hist (fiti 11; 
cex.main = 1, 
main = bquote(The ~ Distribution ~ of ~ 10000 ~ beta[0] ~ Estimates), 
xlab = bquote (hat (beta) [0]), 
freq = F) 
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# add true distribution to plot 
curve (dnorm(x, 


Hh 
sqrt(var_b0)), 
add = T, 
col = "darkred") 


# plot histograms of beta_hat_1 
histGrith. 2): 
cex.main = 1, 
main = bquote(The ~ Distribution ~ of ~ 10000 ~ beta[1] ~ Estimates), 
xlab = bquote (hat (beta) [1]), 
freq = F) 


# add true distribution to plot 
curve (dnorm(x, 


SoG 
sqrt(var_b1)), 
add = T, 
col = "darkred") 
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Our variance estimates support the statements made in Key Concept 4.4, com- 
ing close to the theoretical values. The histograms suggest that the distributions 
of the estimators can be well approximated by the respective theoretical normal 
distributions stated in Key Concept 4.4. 


Simulation Study 2 


A further result implied by Key Concept 4.4 is that both estimators are consis- 
tent, i.e., they converge in probability to the true parameters we are interested 
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in. This is because they are asymptotically unbiased and their variances con- 
verge to 0 as n increases. We can check this by repeating the simulation above 
for a sequence of increasing sample sizes. This means we no longer assign the 
sample size but a vector of sample sizes: n <- c(...). Let us look at the 
distributions of 81. The idea here is to add an additional call of for() to the 
code. This is done in order to loop over the vector of sample sizes n. For each 
of the sample sizes we carry out the same simulation as before but plot a den- 
sity estimate for the outcomes of each iteration over n. Notice that we have to 
change n to n[j] in the inner loop to ensure that the jt” element of n is used. 
In the simulation, we use sample sizes of 100, 250, 1000 and 3000. Consequently 
we have a total of four distinct simulations using different sample sizes. 


# set seed for reproducibility 
set.seed(1) 


# set repetitions and the vector of sample sizes 
reps <- 1000 
n <- c(100, 250, 1000, 3000) 


# initialize the matriz of outcomes 
fit <- matrix(ncol = 2, nrow = reps) 


# divide the plot panel in a 2-by-2 array 
par(mfrow = c(2, 2)) 


# loop sampling and plotting 


# outer loop over n 
for (j in 1:length(n)) { 


# inner loop: sampling and estimating of the coefficients 
for (i in 1:reps){ 


sample <- population[sample(1:N, n[j]), ] 
fitli, ] <- lm(Y ~ X, data = sample) $coefficients 


# draw density estimates 
plot(density(fit[ ,2]), xlim=c(2.5, 4.5), 
Colac 
main = paste("n=", n[j]), 
xlab = bquote(hat (beta) [1])) 


4.5. THE SAMPLING DISTRIBUTION OF THE OLS ESTIMATOR 121 
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We find that, as n increases, the distribution of 3, concentrates around its mean, 
i.e., its variance decreases. Put differently, the likelihood of observing estimates 
close to the true value of 8; = 3.5 grows as we increase the sample size. The 
same behavior can be observed if we analyze the distribution of Bo instead. 


Simulation Study 3 


Furthermore, (4.1) reveals that the variance of the OLS estimator for 8ı de- 
creases as the variance of the X; increases. In other words, as we increase the 
amount of information provided by the regressor, that is, increasing Var(X), 
which is used to estimate 61, we become more confident that the estimate is 
close to the true value (i.e., Var(3,) decreases). We can visualize this by repro- 
ducing Figure 4.6 from the book. To do this, we sample observations (X;, Y;), 
i = 1,...,100 from a bivariate normal distribution with 


and 
Covu(X, Y) = 4. 


Formally, this is written down as 


Gea C9 


To carry out the random sampling, we make use of the function mvrnorm() 
from the package MASS (Ripley, 2020) which allows to draw random samples 
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from multivariate normal distributions, see ?mvtnorm. Next, we use subset () 
to split the sample into two subsets such that the first set, set1, consists of 
observations that fulfill the condition |X — X| > 1 and the second set, set2, 
includes the remainder of the sample. We then plot both sets and use different 
colors to distinguish the observations. 


# load the MASS package 
library (MASS) 


# set seed for reproducibility 
set.seed (4) 


# simulate bivarite normal data 
bvndata <- mvrnorm(100, 
mi = e5 5). 
Sigma = cbind(c(5, 4), c(4, 5))) 


# assign column names / convert to data. frame 
colnames(bvndata) <- c("X", "Y") 
bvndata <- as.data.frame(bvndata) 


# subset the data 
set1 <- subset (bvndata, abs(mean(X) - X) > 1) 
set2 <- subset(bvndata, abs(mean(X) - X) <= 1) 


# plot both data sets 


plot (set1, 
žlab = UX; 
ylab = Sys 
pch = 19) 
points (set2, 
col = "steelblue", 


pch = 19) 
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It is clear that observations that are close to the sample average of the X; have 
less variance than those that are farther away. Now, if we were to draw a line as 
accurately as possible through either of the two sets it is intuitive that choosing 
the observations indicated by the black dots, i.e., using the set of observations 
which has larger variance than the blue ones, would result in a more precise 
line. Now, let us use OLS to estimate slope and intercept for both sets of 
observations. We then plot the observations along with both regression lines. 


# estimate both regression lines 
lm.seti <- 1lm(Y ~ X, data = set1) 
lm.set2 <- 1lm(Y ~ X, data = set2) 


# plot observations 
plot(seti, xlab = "X", ylab = "Y", pch = 19) 
points(set2, col = "steelblue", pch = 19) 


# add both lines to the plot 
abline(1lm.seti, col = "green") 
abline(1m.set2, col = "red") 
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Evidently, the green regression line does far better in describing data sampled 
from the bivariate normal distribution stated in (4.3) than the red line. This is 
a nice example for demonstrating why we are interested in a high variance of 
the regressor X: more variance in the X; means more information from which 
the precision of the estimation benefits. 


4.6 Exercises 


This interactive part of the book is only available in the HTML version. 


Chapter 5 


Hypothesis Tests and 
Confidence Intervals in the 


Simple Linear Regression 
Model 


This chapter, continues our treatment of the simple linear regression model. The 
following subsections discuss how we may use our knowledge about the sampling 
distribution of the OLS estimator in order to make statements regarding its 
uncertainty. 


These subsections cover the following topics: 


e Testing Hypotheses regarding regression coefficients. 
e Confidence intervals for regression coefficients. 
e Regression when X is a dummy variable. 


e Heteroskedasticity and Homoskedasticity. 


The packages AER (Kleiber and Zeileis, 2020) and scales (Wickham and Seidel, 
2020) are required for reproduction of the code chunks presented throughout this 
chapter. The package scales provides additional generic plot scaling methods. 
Make sure both packages are installed before you proceed. The safest way to do 
so is by checking whether the following code chunk executes without any errors. 
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library (AER) 
library (scales) 


5.1 Testing Two-Sided Hypotheses Concerning 
the Slope Coefficient 


Using the fact that By is approximately normally distributed in large samples 
(see Key Concept 4.4), testing hypotheses about the true value 3 can be done 
as in Chapter 3.2. 


Key Concept 5.1 
General Form of the t-Statistic 


Remember from Chapter 3 that a general t-statistic has the form 


_ estimated value — hypothesized value 
standard error of the estimator 
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Key Concept 5.2 
Testing Hypotheses regarding (6, 


For testing the hypothesis Hp : 6; = (1,0, we need to perform the 
following steps: 


1. Compute the standard error of ĝ1, SE(B:1) 





. Compute the t-statistic 


= By - P10 
SE(p1) 
. Given a two sided alternative (H; : 6; # 61,0) we reject at the 5% 
level if |t| > 1.96 or, equivalently, if the p-value is less than 0.05. 
Recall the definition of the p-value: 





p-value = Pry, 


SE(1) SE(ĝ1) 
=Pry ((t| > l1) 
=2 (tl) 


Bi — Piw S get — Big | 


The last transformation is due to the normal approximation for 
large samples. 





Consider again the OLS regression stored in linear_model from Chapter 4 that 
gave us the regression line 


TestScore = 698.9 — 2.28 x STR, R? =0.051, SER = 18.6. 
(9.47) (0.49) 


Copy and execute the following code chunk if the above model object is not 
available in your working environment. 


# load the `CASchools` dataset 
data(CASchools) 


# add student-teacher ratio 
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CASchools$STR <- CASchools$students/CASchools$teachers 


# add average test-score 
CASchools$score <- (CASchools$read + CASchools$math) /2 


# estimate the model 
linear_model <- lm(score ~ STR, data = CASchools) 


For testing a hypothesis concerning the slope parameter (the coefficient on 
STR), we need SE(ĝ1), the standard error of the respective point estimator. 
As is common in the literature, standard errors are presented in parentheses 
below the point estimates. 


Key Concept 5.1 reveals that it is rather cumbersome to compute the standard 
error and thereby the t-statistic by hand. The question you should be asking 
yourself right now is: can we obtain these values with minimum effort using R? 
Yes, we can. Let us first use summary() to get a summary on the estimated 
coefficients in linear_model. 


Note: Throughout the textbook, robust standard errors are reported. We con- 
sider it instructive keep things simple at the beginning and thus start out with 
simple examples that do not allow for robust inference. Standard errors that are 
robust to heteroskedasticity are introduced in Chapter 5.4 where we also demon- 
strate how they can be computed using R. A discussion of heteroskedasticity- 
autocorrelation robust standard errors takes place in Chapter 15. 


# print the summary of the coefficients to the console 
summary (linear_model)$coefficients 


#> Estimate Std. Error t value Prltl) 
#> (Intercept) 698.932949 9.4674911 73.824516 6.569846e-242 
#> STR -2.279808 0.4798255 -4.751327 2.783308e-06 


The second column of the coefficients’ summary, reports SE(f9) and SE((;). 
Also, in the third column t value, we find t-statistics t°% suitable for tests 
of the separate hypotheses Hp : 69 = 0 and Ho : 6; = 0. Furthermore, the 
output provides us with p-values corresponding to both tests against the two- 
sided alternatives Hı : Bo Æ 0 respectively Hı : 6; Æ 0 in the fourth column of 
the table. 


Let us have a closer look at the test of 


Ao: by =0 US. Ay: By, #0. 


We have 
—2.279808 -0 _ 


0.4798255 


pact = 





4.75. 
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What does this tell us about the significance of the estimated coefficient? We 
reject the null hypothesis at the 5% level of significance since |t| > 1.96. 
That is, the observed test statistic falls into the rejection region as p-value = 
2.78 - 1078 < 0.05. We conclude that the coefficient is significantly different 
from zero. In other words, we reject the hypothesis that the class size has no 
influence on the students test scores at the 5% level. 


Note that although the difference is negligible in the present case as we will 
see later, summary () does not perform the normal approximation but calculates 
p-values using the t-distribution instead. Generally, the degrees of freedom of 
the assumed t-distribution are determined in the following manner: 


DF=n—-k-1 


where n is the number of observations used to estimate the model and k is the 
number of regressors, excluding the intercept. In our case, we have n = 420 
observations and the only regressor is STR so k = 1. The simplest way to 
determine the model degrees of freedom is 


# determine residual degrees of freedom 
linear_model$df.residual 
#> [1] 418 


Hence, for the assumed sampling distribution of By we have 


By ~ t418 


such that the p-value for a two-sided significance test can be obtained by exe- 
cuting the following code: 


2 * pt(-4.751327, df = 418) 
#> [1] 2.78331e-06 


The result is very close to the value provided by summary(). However since n 
is sufficiently large one could just as well use the standard normal density to 
compute the p-value: 


2 * pnorm(-4.751327) 
#> [1] 2.02086e-06 


The difference is indeed negligible. These findings tell us that, if Ho : Gi = 0 
is true and we were to repeat the whole process of gathering observations and 
estimating the model, observing a 3, > | — 2.28] is very unlikely! 


Using R we may visualize how such a statement is made when using the normal 
approximation. This reflects the principles depicted in figure 5.1 in the book. 


130CHAPTER 5. HYPOTHESIS TESTS AND CONFIDENCE INTERVALS IN THE SIMPLE LINEAI 


Do not let the following code chunk deter you: the code is somewhat longer 
than the usual examples and looks unappealing but there is a lot of repetition 
since color shadings and annotations are added on both tails of the normal 
distribution. We recommend to execute the code step by step in order to see 
how the graph is augmented with the annotations. 


# Plot the standard normal on the support [-6,6] 
t <- seq(-6, 6, 0.01) 


plot(x = t, 
y = dnorm(t, 0, 1), 
type = "1", 
col = "steelblue", 
lwd = 2, 
yaxs = "i", 
axes =F, 
ylab = W 
main = expression("Calculating the p-value of a Two-sided Test when" ~ t^act ~ "= 


cex.lab = 0.7, 
cex.main = 1) 


tact <- -4.75 
axis(1, at = cO, -1.96, 1.96, -tact, tact), cex.axis = 0.7) 
# Shade the critical regions using polygon(): 
# critical region in left tail 
polygon(x = c(-6, seq(-6, -1.96, 0.01), -1.96), 
y = c(0, dnorm(seq(-6, -1.96, 0.01)), 0), 


col = '‘orange') 


# critical region in right tail 


polygon(x = c(1.96, seq(1.96, 6, 0.01), 6), 
y = c(0, dnorm(seq(1.96, 6, 0.01)), 0), 
col = '‘orange') 


# Add arrows and texts indicating critical regions and the p-value 
arrows(=3,5.. 0.2, -2.5, 0-02. length = 021) 
arrows (3 5 0.2 2.0. 0202, length = 0.1) 


arrows(-5, 0.16, -4.75, 0, length = 0.1) 
arrows(5, 0.16, 4.75, 0, length = 0.1) 
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text (3.5, 022. 
labels = expression("0.025"~"="~over(alpha, 2)), 
cex = 0.7) 

texti(3 25, 0.22, 
labels = expression("0.025"~"="~over(alpha, 2)), 
cex = 0.7) 


text(-5, 0.18, 


labels = expression(paste("-|",t[act],"|")), 
cex = 0.7) 

text(5, 0.18, 
labels = expression(paste("|",t[act],"|")), 
cex = 0.7) 


# Add ticks indicating critical values at the 0.05-level, t°act and -t^act 
rug(c(-1.96, 1.96), ticksize = 0.145, lwd = 2, col = "darkred") 
rug(c(-tact, tact), ticksize = -0.0451, lwd = 2, col = "darkgreen") 


Calculating the p-value of a Two-sided Test when t** =-4.75 





a a 
cake 0.025 =— 





[tacl 





-ltactl 





-4.75 -1.96 0.00 1.96 4.75 


The p-Value is the area under the curve to left of —4.75 plus the area under the 
curve to the right of 4.75. As we already know from the calculations above, this 
value is very small. 
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5.2 Confidence Intervals for Regression Coeff- 
cients 


As we already know, estimates of the regression coefficients $9 and 6; are sub- 
ject to sampling uncertainty, see Chapter 4. Therefore, we will never exactly 
estimate the true value of these parameters from sample data in an empirical 
application. However, we may construct confidence intervals for the intercept 
and the slope parameter. 


A 95% confidence interval for 8; has two equivalent definitions: 


e The interval is the set of values for which a hypothesis test to the level of 
5% cannot be rejected. 

e The interval has a probability of 95% to contain the true value of 6;. So in 
95% of all samples that could be drawn, the confidence interval will cover 
the true value of {;. 


We also say that the interval has a confidence level of 95%. The idea of the 
confidence interval is summarized in Key Concept 5.3. 


Key Concept 5.3 
A Confidence Interval for {; 


Imagine you could draw all possible random samples of given size. The 
interval that contains the true value 8; in 95% of all samples is given by 
the expression 


C= [3 — 1.96 x SE(B;), Âi + 1.96 x SE(B,)] . 


Equivalently, this interval can be seen as the set of null hypotheses for 
which a 5% two-sided hypothesis test does not reject. 





Simulation Study: Confidence Intervals 
To get a better understanding of confidence intervals we conduct another sim- 


ulation study. For now, assume that we have the following sample of n = 100 
observations on a single variable Y where 


Y; £ (5,25), i=1,..., 100. 


# set seed for reproducibility 
set.seed (4) 
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# generate and plot the sample data 
Y <- rnorm(n = 100, 














mean = 5, 
sd = 5) 
plot(Y, 
pehe= elo) 
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We assume that the data is generated by the model 


Y= e+e 


where u is an unknown constant and we know that €; “KF N (0,25). In this 
model, the OLS estimator for u is given by 


= 1 
û=Y=-} Y, 
i.e., the sample average of the Y;. It further holds that 


Oe 5 
SE(û) = =. 
(â) a a0 


(see Chapter 2) A large-sample 95% confidence interval for u is then given by 





CI¥ o = |â — 1.96 x , +1.96 x (5.1) 


5 5 
V100 V100] - 
It is fairly easy to compute this interval in R by hand. The following code chunk 
generates a named vector containing the interval bounds: 
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cbind(CIllower = mean(Y) - 1.96 * 5 / 10, Clupper = mean(Y) + 1.96 * 5 / 10) 
#> CIlower CIupper 
#> [1,] 4.502625 6.462625 


Knowing that u = 5 we see that, for our example data, the confidence interval 
covers true value. 


As opposed to real world examples, we can use R to get a better understanding of 
confidence intervals by repeatedly sampling data, estimating u and computing 
the confidence interval for u as in (5.1). 


The procedure is as follows: 


e We initialize the vectors lower and upper in which the simulated interval 
limits are to be saved. We want to simulate 10000 intervals so both vectors 
are set to have this length. 

e We use a for() loop to sample 100 observations from the N(5, 25) distri- 
bution and compute ji as well as the boundaries of the confidence interval 
in every iteration of the loop. 

e At last we join lower and upper in a matrix. 


# set seed 
set.seed(1) 


# initialize vectors of lower and upper interval boundaries 
lower <- numeric(10000) 
upper <- numeric(10000) 


# loop sampling / estimation / CI 
for(i in 1:10000) { 


Y <- rnorm(100, mean = 5, sd = 5) 
lower[i] <- mean(Y) - 1.96 * 5 / 10 
upper[i] <- mean(Y) + 1.96 * 5 / 10 


# join vectors of interval bounds in a matriz 
CIs <- cbind(lower, upper) 


According to Key Concept 5.3 we expect that the fraction of the 10000 simulated 
intervals saved in the matrix CIs that contain the true value u = 5 should be 
roughly 95%. We can easily check this using logical operators. 
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mean(CIs[, 1] <= 5 & 5 <= CIs[, 2]) 
#> [1] 0.9487 


The simulation shows that the fraction of intervals covering u = 5, i.e., those 
intervals for which Ho : u = 5 cannot be rejected is close to the theoretical value 
of 95%. 


Let us draw a plot of the first 100 simulated confidence intervals and indicate 
those which do not cover the true value of u. We do this via horizontal lines 
representing the confidence intervals on top of each other. 


# identify intervals not covering mu 
# (4 intervals out of 100) 
ID <- which(!(CIs[1:100, 1] <= 5 & 5 <= CIs[1:100, 2])) 


# initialize the plot 
plot(o, 
zlim = (3s Oe 
ylim = c(1, 100), 
ylab = "Sample", 
xlab = expression(mu), 
main = "Confidence Intervals") 


# set up color vector 
colors <- rep(gray(0.6), 100) 
colors[ID] <- "red" 


# draw reference line at mu=5 
abline(v = 5, lty = 2) 


# add horizontal bars representing the CIs 
for(j ini 100) 


lines (C (CTs. till CIs Ijs 21. 


Ar 
col = colors[j], 
lwd = 2) 
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Confidence Intervals 


Sample 
40 60 80 100 


20 





For the first 100 samples, the true null hypothesis is rejected in four cases so 
these intervals do not cover u = 5. We have indicated the intervals which lead 
to a rejection of the null red. 


Let us now come back to the example of test scores and class sizes. The regres- 
sion model from Chapter 4 is stored in linear_model. An easy way to get 95% 
confidence intervals for Go and 81, the coefficients on (intercept) and STR, is 
to use the function confint(). We only have to provide a fitted model object 
as an input to this function. The confidence level is set to 95% by default but 
can be modified by setting the argument level, see ?confint. 


# compute 95% confidence interval for coefficients in 'linear_model' 
confint (linear_model) 


#> 2 7h ee 7h 
#> (Intercept) 680.32312 717.542775 
#> STR -3.22298 -1.336636 


Let us check if the calculation is done as we expect it to be for 81, the coefficient 
on STR. 
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# compute 95% confidence interval for coefficients in 'Linear_model' by hand 
lm_summ <- summary (linear_model) 


lm_summ$df[2]) * lm_summ$coef[2, 2], 
lm_summ$df[2]) * lm_summ$coef[2, 2]) 


eC" lower" 1m_summ$coef [2,1] - qt(0.975, df 
"upper" = 1lm_summ$coef [2,1] + qt(0.975, df 

#> lower upper 

#> -3.222980 -1.336636 


The upper and the lower bounds coincide. We have used the 0.975-quantile of 
the t418 distribution to get the exact result reported by confint. Obviously, 
this interval does not contain the value zero which, as we have already seen in 
the previous section, leads to the rejection of the null hypothesis (1,9 = 0. 


5.3 Regression when X is a Binary Variable 


Instead of using a continuous regressor X, we might be interested in running 
the regression 


Yi = bo + b1 Di + ui (5.2) 


where D; is a binary variable, a so-called dummy variable. For example, we may 
define D; as follows: 


_ J1 if STR in i™ school district < 20 (5.3) 
‘0 if STR in i” school district > 20 l 
The regression model now is 
TestScore; = bo + b1 Di + ui. (5.4) 
Let us see how these data look like in a scatter plot: 
# Create the dummy variable as defined above 
CASchools$D <- CASchools$STR < 20 
# Plot the data 
plot (CASchools$D, CASchools$score, # provide the data to be plotted 
pen = 20; # use filled circles as plot symbols 
cez Wait # set size of plot symbols to 0.5 
cole = Woteciibiiiels # set the symbols' color to "Steelblue" 
xlab = expression(D[i]), # Set title and axis names 
ylab = "Test Score”, 
main = "Dummy Regression") 
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Dummy Regression 





Test Score 
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620 








D; 


With D as the regressor, it is not useful to think of 6; as a slope parameter 
since D; € {0,1}, i.e., we only observe two discrete values instead of a contin- 
uum of regressor values. There is no continuous line depicting the conditional 
expectation function E(TestScore;|D;) since this function is solely defined for 
x-positions 0 and 1. 


Therefore, the interpretation of the coefficients in this regression model is as 
follows: 


e E(Y;|D; = 0) = o, so o is the expected test score in districts where 
D; = 0 where STR is above 20. 


e E(Y;|D; = 1) = bo + bı or, using the result above, 6, = E(Y;|D; = 
1)—E(Y;|D; = 0). Thus, 81 is the difference in group-specific expectations, 
i.e., the difference in expected test score between districts with STR < 20 
and those with STR > 20. 


We will now use R to estimate the dummy regression model as defined by the 
equations (5.2) and (5.3) . 


# estimate the dummy regression model 
dummy_model <- lm(score ~ D, data = CASchools) 
summary (dummy_model1) 

#> 

#> Call: 

#> lm(formula = score ~ D, data = CASchools) 
#> 

#> Residuals: 

#> Min 1Q Median 30 Maz 
#> -50.496 -14.029 -0.346 12.884 49.504 
#> 
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#> Coefficients: 


#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 650.077 1.393 466.666 < 2e-16 *** 

#> DTRUE 72169. 1.847 3.882 0.00012 *** 

#> --- 

wo Signu- codes: O Vxx* O 00I (#10701) CO 0S O] 
#> 

#> Residual standard error: 18.74 on 418 degrees of freedom 

#> Multiple R-squared: 0.0348, Adjusted R-squared: 0.0325 


#> F-statistic: 15.07 on 1 and 418 DF, p-value: 0.0001202 





summary() reports the p-value of the test that the coefficient on (Inter- 
cept) is zero to to be < 2e-16. This scientific notation states that the 
p-value is smaller than ca, so a very small number. The reason for 
this is that computers cannot handle arbitrary small numbers. In fact, 
qa is the smallest possble number R can work with. 











The vector CASchools\$D has the type logical (to see this, use typeof (CASchools$D) ) 
which is shown in the output of summary(dummy_model): the label DTRUE 

states that all entries TRUE are coded as 1 and all entries FALSE are coded as 0. 

Thus, the interpretation of the coefficient DTRUE is as stated above for 61. 


One can see that the expected test score in districts with STR < 20 (D; = 1) is 
predicted to be 650.1 + 7.17 = 657.27 while districts with STR > 20 (D; = 0) 
are expected to have an average test score of only 650.1. 


Group specific predictions can be added to the plot by execution of the following 
code chunk. 


# add group specific predictions to the plot 
points(x = CASchools$D, 

y = predict (dummy_model), 

col = "red", 

pch = 20) 


Here we use the function predict () to obtain estimates of the group specific 
means. The red dots represent these sample group averages. Accordingly, 3; = 
7.17 can be seen as the difference in group averages. 


summary (dummy_model) also answers the question whether there is a statisti- 
cally significant difference in group means. This in turn would support the hy- 
pothesis that students perform differently when they are taught in small classes. 
We can assess this by a two-tailed test of the hypothesis Hp : 81 = 0. Conve- 
niently, the t-statistic and the corresponding p-value for this test are computed 
by summary (). 
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Since t value = 3.88 > 1.96 we reject the null hypothesis at the 5% level of 
significance. The same conclusion results when using the p-value, which reports 
significance up to the 0.00012% level. 


As done with linear_model, we may alternatively use the function confint() 
to compute a 95% confidence interval for the true difference in means and see 
if the hypothesized value is an element of this confidence set. 


# confidence intervals for coefficients in the dummy regression model 
conf int (dummy_model1) 


#> 20% OIX 
#> (Intercept) 647.338594 652.81500 
#> DTRUE 3.539562 10.79931 


We reject the hypothesis that there is no difference between group means at the 
5% significance level since 61,9 = 0 lies outside of [3.54, 10.8], the 95% confidence 
interval for the coefficient on D. 


5.4 Heteroskedasticity and Homoskedasticity 


All inference made in the previous chapters relies on the assumption that the 
error variance does not vary as regressor values change. But this will often not 
be the case in empirical applications. 


Key Concept 5.4 


Heteroskedasticity and Homoskedasticity 


e The error term of our regression model is homoskedastic if the 
variance of the conditional distribution of u; given X;, Var(u;|X; = 
x), is constant for all observations in our sample: 


Var(u;|X; = x) =o? Vi=1,...,n. 


e If instead there is dependence of the conditional variance of u; on 
Xi, the error term is said to be heteroskedastic. We then write 


Var(u;|X; = x) =o eee 


e Homoskedasticity is a special case of heteroskedasticity. 





For a better understanding of heteroskedasticity, we generate some bivariate 
heteroskedastic data, estimate a linear regression model and then use box plots 
to depict the conditional distributions of the residuals. 
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# load scales package for adjusting color opacities 
library (scales) 


# generate some heteroskedastic data: 


# set seed for reproducibility 
set .seed (123) 


# set up vector of z coordinates 
x <- rep(c(10, 15, 20, 25), each = 25) 


# initialize vector of errors 
en <a CO) 


# sample 100 errors such that the variance increases with z 
e[1:25] <- rnorm(25, sd = 10) 

e[26:50] <- rnorm(25, sd = 15) 

e[51:75] <- rnorm(25, sd = 20) 

e[76:100] <- rnorm(25, sd = 25) 


# set up y 
ys 1230) = SoS) Ee a 


# Estimate the model 
mod <- lm(y ~ x) 


# Plot the data 


plot(x = x, 
Me BE 
main = "An Example of Heteroskedasticity", 


xlab = "Student-Teacher Ratio", 
ylab = "Test Score”; 

cex = 0.5, 

pent soe 

xlim = ¢(8, 27); 

ylim = c(600, 710)) 


# Add the regression line to the plot 
abline(mod, col = "darkred") 


# Add bozplots to the plot 
boxplot (formula =y ~ x, 
add = TRUE, 
at = c(10, 15, 20, 25), 
col = alpha("gray", 0.4), 
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border = "black") 


An Example of Heteroskedasticity 
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We have used the formula argument y ~ x in boxplot() to specify that we 
want to split up the vector y into groups according to x. boxplot(y ~ x) 
generates a boxplot for each of the groups in y defined by x. 


For this artificial data it is clear that the conditional error variances differ. 
Specifically, we observe that the variance in test scores (and therefore the vari- 
ance of the errors committed) increases with the student teacher ratio. 


A Real-World Example for Heteroskedasticity 


Think about the economic value of education: if there were no expected eco- 
nomic value-added to receiving university education, you probably would not 
be reading this script right now. A starting point to empirically verify such a 
relation is to have data on working individuals. More precisely, we need data 
on wages and education of workers in order to estimate a model like 


wage; = Bo + b1 - education; + ui. 


What can be presumed about this relation? It is likely that, on average, higher 
educated workers earn more than workers with less education, so we expect 
to estimate an upward sloping regression line. Also, it seems plausible that 
earnings of better educated workers have a higher dispersion than those of low- 
skilled workers: solid education is not a guarantee for a high salary so even 
highly qualified workers take on low-income jobs. However, they are more likely 
to meet the requirements for the well-paid jobs than workers with less education 
for whom opportunities in the labor market are much more limited. 
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To verify this empirically we may use real data on hourly earnings and the 
number of years of education of employees. Such data can be found in 
CPSSWEducation. This data set is part of the package AER and comes from 
the Current Population Survey (CPS) which is conducted periodically by the 
Bureau of Labor Statistics in the United States. 


The subsequent code chunks demonstrate how to import the data into R and 
how to produce a plot in the fashion of Figure 5.3 in the book. 


# load package and attach data 
library (AER) 
data("CPSSWEducation") 

attach (CPSSWEducation) 


# get an overview 
summary (CPSSWEducation) 


#> age gender earnings education 

#> Min. 129.0 femate:1202 Min. 2.137 Min. 26-00 
#> 1st Qu.:29.0 male :1748 ee. (iis L ORSI “hse (Ete Bales (0X0) 
#> Median :29.0 Median :14.615 Median :13.00 
#> Mean 129.5 Mean 116.743 Mean £13.55 
#> 3rd Qu. 3070. grd Qu- -207192 Sra Qu 16-00 
#> Maz. :30.0 Maz. :97.500 Maz. £18.00 


# estimate a simple regression model 
labor_model <- lm(earnings ~ education) 


# plot observations and add the regression line 
plot (education, 

earnings, 

ylim = c(0, 150)) 


abline(labor_model, 
col = "steelblue", 
lwd = 2) 


144CHAPTER 5. HYPOTHESIS TESTS AND CONFIDENCE INTERVALS IN THE SIMPLE LINEAI 


150 
| 


100 
| 


earnings 














education 


The plot reveals that the mean of the distribution of earnings increases with the 
level of education. This is also supported by a formal analysis: the estimated 
regression model stored in labor_mod shows that there is a positive relation 
between years of education and earnings. 


# print the contents of labor_model to the console 
labor_model 


#> 

#> Call: 

#> lm(formula = earnings ~ education) 
#> 

#> Coefficients: 

#> (Intercept) education 

> aes 1.467 


The estimated regression equation states that, on average, an additional year 
of education increases a worker’s hourly earnings by about $1.47. Once more 
we use confint() to obtain a 95% confidence interval for both regression coef- 
ficients. 


# compute a 95% confidence interval for the coefficients in the model 
conf int (labor_model) 

#> BoB ih Eats VA 

#> (Intercept) -5.015248 -1.253495 

#> education 1.330098 1.603753 


Since the interval is [1.33, 1.60] we can reject the hypothesis that the coefficient 
on education is zero at the 5% level. 


Furthermore, the plot indicates that there is heteroskedasticity: if we assume the 
regression line to be a reasonably good representation of the conditional mean 
function E(earnings;|education;), the dispersion of hourly earnings around that 
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function clearly increases with the level of education, i.e., the variance of the 
distribution of earnings increases. In other words: the variance of the errors 
(the errors made in explaining earnings by education) increases with education 
so that the regression errors are heteroskedastic. 


This example makes a case that the assumption of homoskedasticity is doubtful 
in economic applications. Should we care about heteroskedasticity? Yes, we 
should. As explained in the next section, heteroskedasticity can have serious 
negative consequences in hypothesis testing, if we ignore it. 


Should We Care About Heteroskedasticity? 


To answer the question whether we should worry about heteroskedasticity being 
present, consider the variance of 6; under the assumption of homoskedasticity. 
In this case we have 





= 4, (5.5) 


which is a simplified version of the general equation (4.1) presented in Key Con- 
cept 4.4. See Appendix 5.1 of the book for details on the derivation. summary () 
estimates (5.5) by 





ie ER? 1 
pe cs where SER = S > a. 


Thus summary () estimates the homoskedasticity-only standard error 


[2 SER? 
Oo A = Fe ett te ———_ “ee 
a VE- 


This is in fact an estimator for the standard deviation of the estimator By that 


is inconsistent for the true value o3 when there is heteroskedasticity. The 


implication is that t-statistics computed in the manner of Key Concept 5.1 do 
not follow a standard normal distribution, even in large samples. This issue 
may invalidate inference when using the previously treated tools for hypothesis 
testing: we should be cautious when making statements about the significance 
of regression coefficients on the basis of t-statistics as computed by summary () or 
confidence intervals produced by confint () if it is doubtful for the assumption 
of homoskedasticity to hold! 


We will now use R to compute the homoskedasticity-only standard error for A 
in the test score regression model labor_model by hand and see that it matches 
the value produced by summary (). 
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# Store model summary in 'model' 
model <- summary (labor_model) 


# Extract the standard error of the regression from model summary 
SER <- model$sigma 


# Compute the variation in ‘education’ 
V <- (nrow(CPSSWEducation)-1) * var(education) 


# Compute the standard error of the slope parameter's estimator and print it 
SE.beta_1.hat <- sqrt (SER*2/V) 

SE.beta_1i.hat 

#> [1] 0.06978281 


# Use logical operators to see if the value computed by hand matches the one provided 
# in mod$coefficients. Round estimates to four decimal places 
round(model$coefficients[2, 2], 4) == round(SE.beta_1.hat, 4) 

#> [1] TRUE 


Indeed, the estimated values are equal. 
Computation of Heteroskedasticity-Robust Standard Er- 
rors 


Consistent estimation of aĝ, under heteroskedasticity is granted when the fol- 
lowing robust estimator is used. 





(5.6) 





Standard error estimates computed this way are also referred to as Eicker-Huber- 
White standard errors, the most frequently cited paper on this is White (1980). 


It can be quite cumbersome to do this calculation by hand. Luckily certain 
R functions exist, serving that purpose. A convenient one named vcovHC() is 
part of the package sandwich.! This function can compute a variety of standard 
errors. The one brought forward in (5.6) is computed when the argument type 
is set to "HCO". Most of the examples presented in the book rely on a slightly 
different formula which is the default in the statistics package STATA: 





1The package sandwich is a dependency of the package AER, meaning that it is attached 
automatically if you load AER. 
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A 1 1. a Xi— X 202 
SE(Ai)aci = mE a 7 (5.2) 





The difference is that we multiply by —+, in the numerator of (5.2). This is 
a degrees of freedom correction and was considered by MacKinnon and White 
(1985). To get vcovHC() to use (5.2), we have to set type = "HC1i". 


Let us now compute robust standard error estimates for the coefficients in 
linear_model. 


# compute heteroskedasticity-robust standard errors 


vcov <- vcovHC(linear_model, type = "HCi") 
vcov 

#> (Intercept) STR 

#> (Intercept) 107.419993 -5.3639114 

#> STR -5.363911 0.2698692 


The output of vcovHC() is the variance-covariance matrix of coefficient esti- 
mates. We are interested in the square root of the diagonal elements of this 
matrix, i.e., the standard error estimates. 





When we have k > 1 regressors, writing down the equations for a regres- 
sion model becomes very messy. A more convinient way to denote and 
estimate so-called multiple regression models (see Chapter 6) is by using 
matrix algebra. This is why functions like veovHC() produce matrices. 
In the simple linear regression model, the variances and covariances of 
the estimators can be gathered in the symmetric variance-covariance 
matrix 


Bo _ Var( Go) Cov(Bo, a 
i (2) (cots bı) Var(61) ee) 


so vcovHC() gives us Var( 80), Var( 31) and Cov(Bo, 81), but most of the 
time we are interested in the diagonal elements of the estimated matrix. 











# compute the square root of the diagonal elements in vucov 
robust_se <- sqrt (diag(vcov) ) 

robust_se 

#> (Intercept) STR 

#> 10.3643617 0.5194893 
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Now assume we want to generate a coefficient summary as provided by 
summary() but with robust standard errors of the coefficient estimators, robust 
t-statistics and corresponding p-values for the regression model linear_model. 
This can be done using coeftest() from the package lmtest, see ?coeftest. 
Further we specify in the argument vcov. that vcov, the Eicker-Huber-White 
estimate of the variance matrix we have computed before, should be used. 


# we invoke the function `coeftest()` on our model 


coeftest(linear_model, vcov. = vcov) 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 698.93295 10.36436 67.4362 < 2.2e-16 *** 

#> STR -2.27981 0.51949 -4.3886 1.447e-05 **x 

Ta 
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We see that the values reported in the column Std. Error are equal those 
from sqrt (diag(vcov)). 


How severe are the implications of using homoskedasticity-only standard errors 
in the presence of heteroskedasticity? The answer is: it depends. As men- 
tioned above we face the risk of drawing wrong conclusions when conducting 
significance tests. Let us illustrate this by generating another example of a het- 
eroskedastic data set and using it to estimate a simple regression model. We 
take 


Yi = B1: Xit ui , u; "A N(0,0.36- X2) 


with 6; = 1 as the data generating process. Clearly, the assumption of ho- 
moskedasticity is violated here since the variance of the errors is a nonlinear, 
increasing function of X; but the errors have zero mean and are i.i.d. such that 
the assumptions made in Key Concept 4.3 are not violated. As before, we are 
interested in estimating 61. 


set.seed (905) 

# generate heteroskedastic data 

X <- 1:500 

Y <- rnorm(n = 500, mean = X, sd = 0.6 * X) 


# estimate a simple regression model 
reg < Im(Y ~ xX) 
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We plot the data and add the regression line. 


# plot the data 
plot(x = X, y = Y, 

joxelay Sealey, 

col "steelblue", 
0.8) 


cex 


# add the regression line to the plot 
abline (reg, 

col = "darkred", 

lwd = 1.5) 














The plot shows that the data are heteroskedastic as the variance of Y grows with 
X. We next conduct a significance test of the (true) null hypothesis Ho : 61 = 1 
twice, once using the homoskedasticity-only standard error formula and once 
with the robust version (5.6). An easy way to do this in R is the function 
linearHypothesis() from the package car, see ?7linearHypothesis. It allows 
to test linear hypotheses about parameters in linear models in a similar way as 
done with a t-statistic and offers various robust covariance matrix estimators. 
We test by comparing the tests’ p-values to the significance level of 5%. 
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linearHypothesis() computes a test statistic that follows an F- 
distribution under the null hypothesis. We will not focus on the de- 
tails of the underlying theory. In general, the idea of the F-test is to 
compare the fit of different models. When testing a hypothesis about a 
single coefficient using an F-test, one can show that the test statistic is 
simply the square of the corresponding t-statistic: 


F 2 
_2_(Pi-fio)\ | 
F=t= (jae Fin-k-1 


In linearHypothesis(), there are different ways to specify the hypothesis 
to be tested, e.g., using a vector of the type character (as done in the 
next code chunk), see ?linearHypothesis for alternatives. The function 
returns an object of class anova which contains further information on 
the test that can be accessed using the $ operator. 











# test hypthesis using the default standard error formula 
linearHypothesis(reg, hypothesis.matrix = "X = 1")$'Pr(>F)'[2] < 0.05 
#> [1] TRUE 


# test hypothesis using the robust standard error formula 
linearHypothesis(reg, hypothesis.matrix = "X = 1", white.adjust = "hc1")$'Pr(>F)'[2] < 
#> [1] FALSE 


This is a good example of what can go wrong if we ignore heteroskedasticity: 
for the data set at hand the default method rejects the null hypothesis 6; = 1 
although it is true. When using the robust standard error formula the test does 
not reject the null. Of course, we could think this might just be a coincidence 
and both tests do equally well in maintaining the type I error rate of 5%. This 
can be further investigated by computing Monte Carlo estimates of the rejection 
frequencies of both tests on the basis of a large number of random samples. We 
proceed as follows: 


e initialize vectors t and t.rob. 

e Using a for() loop, we generate 10000 heteroskedastic random samples 
of size 1000, estimate the regression model and check whether the tests 
falsely reject the null at the level of 5% using comparison operators. The 
results are stored in the respective vectors t and t.rob. 

e After the simulation, we compute the fraction of false rejections for both 
tests. 


5.5. THE GAUSS-MARKOV THEOREM 151 


# initialize vectors t and t.rob 
t <- cQ 
t.rob <- c() 


# loop sampling and estimation 
for(i and: 10000) 4 


# sample data 
X <- 1:1000 
Y <- rnorm(n = 1000, mean = X, sd = 0.6 * X) 


# estimate regression model 
reg <- Im(Y ~ X) 


# homoskedasdicity-only significance test 
t[i] <- linearHypothesis(reg, "X = 1")$'Pr(>F)'[2] < 0.05 


# robust significance test 
t.robl[i] <- linearHypothesis(reg, "X = 1", white.adjust = "hci")$'Pr(>F)'[2] < 0.05 


# compute the fraction of false rejections 
round(cbind(t = mean(t), t.rob = mean(t.rob)), 3) 
#> t t.rob 

#2 01073 0.05 


These results reveal the increased risk of falsely rejecting the null using the 
homoskedasticity-only standard error for the testing problem at hand: with the 
common standard error, 7.28% of all tests falsely reject the null hypothesis. In 
contrast, with the robust test statistic we are closer to the nominal level of 5%. 


5.5 The Gauss-Markov Theorem 


When estimating regression models, we know that the results of the estima- 
tion procedure are random. However, when using unbiased estimators, at least 
on average, we estimate the true parameter. When comparing different unbi- 
ased estimators, it is therefore interesting to know which one has the highest 
precision: being aware that the likelihood of estimating the exact value of the 
parameter of interest is 0 in an empirical application, we want to make sure 
that the likelihood of obtaining an estimate very close to the true value is as 
high as possible. This means we want to use the estimator with the lowest 
variance of all unbiased estimators, provided we care about unbiasedness. The 
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Gauss-Markov theorem states that, in the class of conditionally unbiased linear 
estimators, the OLS estimator has this property under certain conditions. 


Key Concept 5.5 3 
The Gauss-Markov Theorem for (6; 


Suppose that the assumptions made in Key Concept 4.3 hold and 
that the errors are homoskedastic. The OLS estimator is the best (in 
the sense of smallest variance) linear conditionally unbiased estimator 
(BLUE) in this setting. 


Let us have a closer look at what this means: 


e Estimators of (6, that are linear functions of the Yj,...,Y, and 
that are unbiased conditionally on the regressor X1,..., Xn can be 


By = pe aiY; 
i=1 


written as 


where the a; are weights that are allowed to depend on the X; but 
not on the Y;. 


We already know that (6, has a sampling distribution: (, is a linear 
function of the Y; which are random variables. If now 


E(By|X1,.--.Xn) = Bi, 


8, is a linear unbiased estimator of 6), conditionally on the 
XG ens 


We may ask if 3, is also the best estimator in this class, i.e., the 
most efficient one of all linear conditionally unbiased estimators 
where most efficient means smallest variance. The weights a; play 
an important role here and it turns out that OLS uses just the 
right weights to have the BLUE property. 





Simulation Study: BLUE Estimator 


Consider the case of a regression of Y;,...,Y, only on a constant. Here, the 
Y; are assumed to be a random sample from a population with mean u and 
variance o?. The OLS estimator in this model is simply the sample mean, see 
Chapter 3.2. 
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Clearly, each observation is weighted by 


a=. 


2 


and we also know that Var(ĝ1) = T 


We now use R to conduct a simulation study that demonstrates what happens 
to the variance of (5.4) if different weights 


1e 
n 








Wi = 


are assigned to either half of the sample Y;,..., Yn instead of using L, the OLS 
weights. 


# set sample size and number of repetitions 
n <- 100 
reps <- 1e5 


# choose epsilon and create a vector of weights as defined above 
epsilon <- 0.8 
w <- c(rep((1 + epsilon) / n, n / 2), 

rep((1 - epsilon) / n, n / 2) ) 


# draw a random sample y_1,...,y_n from the standard normal distribution, 
# use both estimators 1e5 times and store the result in the vectors 'ols' 


# 'weightedestimator' 


ols <- rep(NA, reps) 
weightedestimator <- rep(NA, reps) 


for (i in 1:reps) { 
y <- rnorm(n) 


ols[i] <- mean(y) 
weightedestimator[i] <- crossprod(w, y) 


# plot kernel density estimates of the estimators' distributions: 


and 
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# OLS 
plot (density(ols), 
col = "purple", 
lwd = 3, 
main = "Density of OLS and Weighted Estimator", 
xlab = "Estimates") 


# weighted 

lines (density (weightedestimator) , 
col = "steelblue", 
lwd = 3) 


# add a dashed line at O and add a legend to the plot 
abline(v = 0, lty = 2) 


legend('topright', 
cC("OLS", "Weighted"), 
col = c("purple", "steelblue"), 
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What conclusion can we draw from the result? 


e Both estimators seem to be unbiased: the means of their estimated distri- 
butions are zero. 

e The estimator using weights that deviate from those implied by OLS is less 
efficient than the OLS estimator: there is higher dispersion when weights 


are w; = +708 instead of w; = q4 as required by the OLS solution. 








Hence, the simulation results support the Gauss-Markov Theorem. 
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5.6 Using the t-Statistic in Regression When 
the Sample Size Is Small 


The three OLS assumptions discussed in Chapter 4 (see Key Concept 4.3) are the 
foundation for the results on the large sample distribution of the OLS estimators 
in the simple regression model. What can be said about the distribution of the 
estimators and their ¢-statistics when the sample size is small and the population 
distribution of the data is unknown? Provided that the three least squares 
assumptions hold and the errors are normally distributed and homoskedastic (we 
refer to these conditions as the homoskedastic normal regression assumptions), 
we have normally distributed estimators and t-distributed test statistics in small 
samples. 


Recall the definition of a t-distributed variable 


= t 
~ ~ tu 

W/M 
where Z is a standard normal random variable, W is x? distributed with M 
degrees of freedom and Z and W are independent. See section 5.6 in the book 
for a more detailed discussion of the small sample distribution of t-statistics in 
regression methods. 


Let us simulate the distribution of regression t-statistics based on a large number 
of small random samples, say n = 20, and compare the simulated distributions 
to the theoretical distributions which should be tıg, the t-distribution with 18 
degrees of freedom (recall that DF = n — k — 1). 


# initialize two vectors 
beta_0 <- c() 
beta_1 <- c() 


# loop sampling / estimation / t statistics 
for (i in 1:10000) { 


X <- runif(20, 0, 20) 

Y <- rnorm(n = 20, mean = X) 

reg <- summary(1m(Y ~ X)) 

beta_O[i] <- (reg$coefficients[1, 1] - 0)/(reg$coefficients[1, 2]) 
beta_1[i] <- (reg$coefficients[2, 1] - 1)/(reg$coefficients[2, 2]) 


} 


# plot the distributions and compare with t_18 density: 
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# divide plotting area 
par(mfrow = c(1, 2)) 


# plot the simulated density of beta_0O 
plot (density (beta_0), 
Iudis 2n 
main = expression(widehat (beta) [0]), 
xlim = c(-4, 4)) 


# add the t_18 density to the plot 
curve(dt(x, df = 18), 


add = T, 
col = "red", 
lwd = 2, 
lty = 2) 


# plot the simulated density of beta_1 
plot (density(beta_1), 
lwd = 2, 
main = expression(widehat (beta) [1]), xlim = c(-4, 4) 


) 


# add the t_18 density to the plot 
curve(dt(x, df = 18), 
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The outcomes are consistent with our expectations: the empirical distributions 
of both estimators seem to track the theoretical tıg distribution quite closely. 
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5.7 Exercises 


This interactive part of the book is only available in the HTML version. 
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Chapter 6 


Regression Models with 
Multiple Regressors 


In what follows we introduce linear regression models that use more than just one 
explanatory variable and discuss important key concepts in multiple regression. 
As we broaden our scope beyond the relationship of only two variables (the 
dependent variable and a single regressor) some potential new issues arise such 
as multicollinearity and omitted variable bias (OVB). In particular, this chapter 
deals with omitted variables and its implication for causal interpretation of 
OLS-estimated coefficients. 


Naturally, we will discuss estimation of multiple regression models using R. We 
will also illustrate the importance of thoughtful usage of multiple regression 
models via simulation studies that demonstrate the consequences of using highly 
correlated regressors or misspecified models. 


The packages AER (Kleiber and Zeileis, 2020) and MASS (Ripley, 2020) are needed 
for reproducing the code presented in this chapter. Make sure that the following 
code chunk executes without any errors. 


library (AER) 
library (MASS) 


6.1 Omitted Variable Bias 


The previous analysis of the relationship between test score and class size dis- 
cussed in Chapters 4 and 5 has a major flaw: we ignored other determinants 
of the dependent variable (test score) that correlate with the regressor (class 
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size). Remember that influences on the dependent variable which are not cap- 
tured by the model are collected in the error term, which we so far assumed to 
be uncorrelated with the regressor. However, this assumption is violated if we 
exclude determinants of the dependent variable which vary with the regressor. 
This might induce an estimation bias, i.e., the mean of the OLS estimator’s 
sampling distribution is no longer equals the true mean. In our example we 
therefore wrongly estimate the causal effect on test scores of a unit change in 
the student-teacher ratio, on average. This issue is called omitted variable bias 
(OVB) and is summarized by Key Concept 6.1. 


Key Concept 6.1 
Omitted Variable Bias in Regression with a Single Regressor 


Omitted variable bias is the bias in the OLS estimator that arises when 
the regressor, X, is correlated with an omitted variable. For omitted 
variable bias to occur, two conditions must be fulfilled: 


1. X is correlated with the omitted variable. 


2. The omitted variable is a determinant of the dependent variable 
yo 


Together, 1. and 2. result in a violation of the first OLS assumption 
E(u;|X;) = 0. Formally, the resulting bias can be expressed as 


A Ou 
By > Bi + pxu—. (6.1) 
Ox 


See Appendix 6.1 of the book for a detailed derivation. (6.1) states that 
OVB is a problem that cannot be solved by increasing the number of 
observations used to estimate (1, as A is inconsistent: OVB prevents 
the estimator from converging in probability to the true parameter value. 
Strength and direction of the bias are determined by pxu, the correlation 
between the error term and the regressor. 





In the example of test score and class size, it is easy to come up with variables 
that may cause such a bias, if omitted from the model. As mentioned in the 
book, a highly relevant variable could be the percentage of English learners in the 
school district: it is plausible that the ability to speak, read and write English 
is an important factor for successful learning. Therefore, students that are still 
learning English are likely to perform worse in tests than native speakers. Also, 
it is conceivable that the share of English learning students is bigger in school 
districts where class sizes are relatively large: think of poor urban districts 
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where a lot of immigrants live. 


Let us think about a possible bias induced by omitting the share of English 
learning students (PctEL) in view of (6.1). When the estimated regression 
model does not include PctEL as a regressor although the true data generating 
process (DGP) is 


TestScore = bo + 81 X STR + b2 x PcttEL (6.2) 


where STR and PctEL are correlated, we have 


PSTR,PctEL # 0. 


Let us investigate this using R. After defining our variables we may compute the 
correlation between STR and PctEL as well as the correlation between STR 
and TestScore. 


# load the AER package 
library (AER) 


# load the data set 
data(CASchools) 


# define variables 
CASchools$STR <- CASchools$students/CASchools$teachers 
CASchools$score <- (CASchools$read + CASchools$math) /2 


# compute correlations 
cor(CASchools$STR, CASchools$score) 
#> [1] -0.2263627 

cor (CASchools$STR, CASchools$english) 
#> [1] 0.1876424 


The fact that Psrr,Testscore = —0.2264 is cause for concern that omitting 
PctEL leads to a negatively biased estimate By since this indicates that pxu < 0. 
As a consequence we expect Ba, the coefficient on STR, to be too large in ab- 
solute value. Put differently, the OLS estimate of By suggests that small classes 
improve test scores, but that the effect of small classes is overestimated as it 
captures the effect of having fewer English learners, too. 


What happens to the magnitude of B, if we add the variable PctEL to the 
regression, that is, if we estimate the model 


TestScore = bo + bı x STR + Bo x PcttEL+u 
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instead? And what do we expect about the sign of Bo, the estimated coefficient 
on PctEL? Following the reasoning above we should still end up with a negative 
but larger coefficient estimate (6, than before and a negative estimate {o. 


Let us estimate both regression models and compare. Performing a multiple 
regression in R is straightforward. One can simply add additional variables to 
the right hand side of the formula argument of the function 1m() by using their 
names and the + operator. 


# estimate both regression models 
mod <- lm(score ~ STR, data = CASchools) 
mult.mod <- lm(score ~ STR + english, data = CASchools) 


# print the results to the console 

mod 

#> 

#> Call: 

#> lm(formula = score ~ STR, data = CASchools) 
#> 

#> Coefficients: 

#> (Intercept) STR 

#> 698.93 -2.28 

mult .mod 


#> Call: 
#> lm(formula = score ~ STR + english, data = CASchools) 


#> Coefficients: 
#> (Intercept) STR english 
#> 686. 0322 SI LOTS: -0.6498 


We find the outcomes to be consistent with our expectations. 


The following section discusses some theory on multiple regression models. 


6.2 The Multiple Regression Model 


The multiple regression model extends the basic concept of the simple regression 
model discussed in Chapters 4 and 5. A multiple regression model enables us to 
estimate the effect on Y; of changing a regressor X4; if the remaining regressors 
Xo;,X3;---,X~; do not vary. In fact we already have performed estimation 
of the multiple regression model (6.2) using R in the previous section. The 
interpretation of the coefficient on student-teacher ratio is the effect on test 
scores of a one unit change student-teacher ratio if the percentage of English 
learners is kept constant. 
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Just like in the simple regression model, we assume the true relationship between 
Y and Xii, Xq...... , Xx; to be linear. On average, this relation is given by 
the population regression function 


E(Y,|X1; = v1, Xo; = £2, X3i = U3,..., Xki = Ce) = Pot G1 414+ Poxe+B3x3t- + 8kTk. 
(6.3) 





As in the simple regression model, the relation 
Yi = Bo + Pi X1i + B2X2i + b3Xai +--+ + BeXbi 


does not hold exactly since there are disturbing influences to the dependent 
variable Y we cannot observe as explanatory variables. Therefore we add an 
error term u which represents deviations of the observations from the population 
regression line to (6.3). This yields the population multiple regression model 


Yi = Bo + 1X15 + B2X0i + b3 Xi +++ + OeXeie tui, t=1,...,n. (6.4) 








Key Concept 6.2 summarizes the core concepts of the multiple regression model. 


Key Concept 6.2 
The Multiple Regression Model 


Y; is the i” observation in the dependent variable. Observations 
on the k regressors are denoted by Xii, X9;,...,X,; and u; is the 
error term. 


The average relationship between Y and the regressors is given by 


the population regression line 


EXY,|Xa; = 21, Xo; = Lo, Xgq = Hy, ..., Xpi = Xe) = Pot P171+Gox2H03734+- + -+ Cpr. 


Bo is the intercept; it is the expected value of Y when all Xs equal 
O By 9 f= iech Aine iie Gosie OM 265 a d = Myecoal (hh 
measures the expected change in Y; that results from a one unit 
change in X 1; while holding all other regressors constant. 





How can we estimate the coefficients of the multiple regression model (6.4)? 
We will not go too much into detail on this issue as our focus is on using R. 
However, it should be pointed out that, similarly to the simple regression model, 
the coefficients of the multiple regression model can be estimated using OLS. 
As in the simple model, we seek to minimize the sum of squared mistakes by 
choosing estimates bo, b1, ...,bp for the coefficients 6, 31,..., 8, such that 


n 


i 0 14417 24427 pay kA ki : 
Y; — bo — b1 Xu; — b2 X. beX pi)? 6.5 


i=l 
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is minimized. Note that (6.5) is simply an extension of SSR in the case with 
just one regressor and a constant. The estimators that minimize (6.5) are hence 


denoted Bo, Ba, meee Êr and, as in the simple regression model, we call them the 
ordinary least squares estimators of fo, 61,..., 8%. For the predicted value of Y; 
given the regressors and the estimates 6o, 61,...,{8, we have 


Yi = Bo + Êi Xii +- + BeXni- 
The difference between Y; and its predicted value Y; is called the OLS residual 
of observation i: û = Y; — Y;. 


For further information regarding the theory behind multiple regression, see 
Chapter 18.1 of the book which inter alia presents a derivation of the OLS 
estimator in the multiple regression model using matrix notation. 


Now let us jump back to the example of test scores and class sizes. The esti- 
mated model object is mult.mod. As for simple regression models we can use 
summary () to obtain information on estimated coefficients and model statistics. 


summary (mult .mod) $coef 


#> Estimate Std. Error t value Pr(>/tl) 
#> (Intercept) 686.0322445 7.41131160 92.565565 3.871327e-280 
#> STR -1.1012956 0.38027827 -2.896026 3.9°78059e-03 
#> english -0.6497768 0.03934254 -16.515882 1.657448e-47 


So the estimated multiple regression model is 


TestScore = 686.03 — 1.10 x STR — 0.65 x PctEL. (6.6) 


Unlike in the simple regression model where the data can be represented by 
points in the two-dimensional coordinate system, we now have three dimensions. 
Hence observations can be represented by points in three-dimensional space. 
Therefore (6.6) is now longer a regression line but a regression plane. This idea 
extends to higher dimensions when we further expand the number of regressors 
k. We then say that the regression model can be represented by a hyperplane in 
the k +1 dimensional space. It is already hard to imagine such a space if k = 3 
and we best stick with the general idea that, in the multiple regression model, 
the dependent variable is explained by a linear combination of the regressors. 
However, in the present case we are able to visualize the situation. The following 
figure is an interactive 3D visualization of the data and the estimated regression 
plane (6.6). 


This interactive part of the book is only available in the HTML version. 


We observe that the estimated regression plane fits the data reasonably well — at 
least with regard to the shape and spatial position of the points. The color of the 
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markers is an indicator for the absolute deviation from the predicted regression 
plane. Observations that are colored more reddish lie close to the regression 
plane while the color shifts to blue with growing distance. An anomaly that 
can be seen from the plot is that there might be heteroskedasticity: we see that 
the dispersion of regression errors made, i.e., the distance of observations to 
the regression plane tends to decrease as the share of English learning students 
increases. 


6.3 Measures of Fit in Multiple Regression 


In multiple regression, common summary statistics are SER, R? and the ad- 
justed R?. 


Taking the code from Section 6.2, simply use summary(mult.mod) to obtain 
the SER, R? and adjusted-R?. For multiple regression models the SER is 
computed as 


SER = sa = y3 


where modify the denominator of the premultiplied factor in s 
accommodate for additional regressors. Thus, 


2 


a 


in order to 


1 
2y 
sj = — — SSR 


with k denoting the number of regressors excluding the intercept. 


While summary () computes the R? just as in the case of a single regressor, it is 
no reliable measure for multiple regression models. This is due to R? increasing 
whenever an additional regressor is added to the model. Adding a regressor de- 
creases the SSR — at least unless the respective estimated coefficient is exactly 
zero what practically never happens (see Chapter 6.4 of the book for a detailed 
argument). The adjusted R? takes this into consideration by “punishing” the 
addition of regressors using a correction factor. So the adjusted R?, or simply 
R?, is a modified version of R?. It is defined as 


n—1 SSR 
n-k-1TSS° 





R=1 


As you may have already suspected, summary() adjusts the formula for SER 
and it computes R? and of course R? by default, thereby leaving the decision 
which measure to rely on to the user. 


You can find both measures at the bottom of the output produced by calling 
summary (mult .mod). 


166CHAPTER 6. REGRESSION MODELS WITH MULTIPLE REGRESSORS 


summary (mult .mod) 


#> 

#> Call: 

#> lm(formula = score ~ STR + english, data = CASchools) 

#> 

#> Residuals: 

#> Min 1Q Median 3Q Maz 

#> -48.845 -10.240 -0.308 9.815 43.461 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 686.03224 7.41131 92.566 < 2e-16 *** 

#> STR ali OL30) 0.38028 -2.896 0.00398 ** 

#> english -0. 64978 0.03934 -16.516 < 2e-16 *** 

#> --- 

#> Stgnif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
#> 

#> Residual standard error: 14.46 on 417 degrees of freedom 

#> Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237 


HOVE -Stauvseucs LOS ON e and 21M) DE P-value S2 2e a6 


We can also compute the measures by hand using the formulas above. Let us 
check that the results coincide with the values provided by summary (). 


# define the components 


n <- nrow(CASchools) # number of observations (rows) 
k <- 2 # number of regressors 

y_mean <- mean(CASchools$score) # mean of avg. test-scores 

SSR <- sum(residuals (mult .mod) *2) # sum of squared residuals 

TSS <- sum((CASchools$score - y_mean )^2) # total sum of squares 

ESS <- sum((fitted(mult.mod) - y_mean)~2) # explained sum of squares 


# compute the measures 


SER <- sqrt(1/(n-k-1) * SSR) # standard error of the regression 
Rsq <- 1 - (SSR / TSS) # R2 
adj_Rsq <- 1 - (n-1)/(n-k-1) * SSR/TSS # adj. R2 


# print the measures to the console 
c("SER" = SER, "R2" = Rsq, "Adj.R2" = adj_Rsq) 
#> SER R2 Adj. R2 
#> 14.4644831 0.4264315 0.4236805 
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Now, what can we say about the fit of our multiple regression model for test 
scores with the percentage of English learners as an additional regressor? Does 
it improve on the simple model including only an intercept and a measure of 
class size? The answer is yes: compare R? with that obtained for the simple 
regression model mod. 


Including PctEL as a regressor improves the R?, which we deem to be more 
reliable in view of the above discussion. Notice that the difference between R? 
and R? is small since k = 2 and n is large. In short, the fit of (6.6) improves 
vastly on the fit of the simple regression model with ST R as the only regressor. 
Comparing regression errors we find that the precision of the multiple regression 
model (6.6) improves upon the simple model as adding PctE'L lowers the SER 
from 18.6 to 14.5 units of test score. 


As already mentioned, R? may be used to quantify how good a model fits the 
data. However, it is rarely a good idea to maximize these measures by stuffing 
the model with regressors. You will not find any serious study that does so. 
Instead, it is more useful to include regressors that improve the estimation of 
the causal effect of interest which is not assessed by means the R? of the model. 
The issue of variable selection is covered in Chapter 8. 


6.4 OLS Assumptions in Multiple Regression 


In the multiple regression model we extend the three least squares assumptions 
of the simple regression model (see Chapter 4) and add a fourth assumption. 
These assumptions are presented in Key Concept 6.4. We will not go into the 
details of assumptions 1-3 since their ideas generalize easy to the case of multiple 
regressors. We will focus on the fourth assumption. This assumption rules out 
perfect correlation between regressors. 
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Key Concept 6.4 
The Least Squares Assumptions in the Multiple Regression 
Model 


The multiple regression model is given by 


W = fly A Ba ee ee F U y O ooog 


The OLS assumptions in the multiple regression model are an extension 
of the ones made for the simple regression model: 


. Regressors (Xii, Xis.. -, Xki Yi) , i = 1,...,n, are drawn such 
that the i.i.d. assumption holds. 


. w; is an error term with conditional mean zero given the regressors, 
hes 
E (wil Xii, Xoi, eii A) = 0. 


. Large outliers are unlikely, formally Xj;,...,X,; and Y; have finite 
fourth moments. 


. No perfect multicollinearity. 





Multicollinearity 


Multicollinearity means that two or more regressors in a multiple regression 
model are strongly correlated. If the correlation between two or more regres- 
sors is perfect, that is, one regressor can be written as a linear combination of 
the other(s), we have perfect multicollinearity. While strong multicollinearity in 
general is unpleasant as it causes the variance of the OLS estimator to be large 
(we will discuss this in more detail later), the presence of perfect multicollinear- 
ity makes it impossible to solve for the OLS estimator, i.e., the model cannot 
be estimated in the first place. 


The next section presents some examples of perfect multicollinearity and demon- 
strates how 1m() deals with them. 


Examples of Perfect Multicollinearity 


How does R react if we try to estimate a model with perfectly correlated regres- 
sors? 


1m will produce a warning in the first line of the coefficient section of the out- 
put (1 not defined because of singularities) and ignore the regressor(s) 
which is (are) assumed to be a linear combination of the other(s). Consider 
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the following example where we add another variable FracEL, the fraction of 
English learners, to CASchools where observations are scaled values of the ob- 
servations for english and use it as a regressor together with STR and english 
in a multiple regression model. In this example english and FracEL are per- 
fectly collinear. The R code is as follows. 


# define the fraction of English learners 
CASchools$FracEL <- CASchools$english / 100 


# estimate the model 
mult.mod <- 1lm(score ~ STR + english + FracEL, data = CASchools) 


# obtain a summary of the model 

summary (mult .mod) 

#> 

#> Call: 

#> lm(formula = score ~ STR + english + FracEL, data = CASchools) 
#> 

#> Residuals: 


#> Min 1Q Median 3Q Maz 

#> -48.845 -10.240 -0.308 9.815 43.461 

#> 

#> Coefficients: (1 not defined because of singularities) 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 686.03224 7.41131 92.566 < 2e-16 *** 

#> STR she LOL30 0.38028 -2.896 0.00398 ** 

#> english -0.64978 0.03934 -16.516 < 2e-16 *** 

#> FracEL NA NA NA NA 

2 

#> Stigntf. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
#> 

#> Residual standard error: 14.46 on 417 degrees of freedom 

#> Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237 


#> F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16 


The row FracEL in the coefficients section of the output consists of NA entries 
since FracEL was excluded from the model. 


If we were to compute OLS by hand, we would run into the same problem but 
no one would be helping us out! The computation simply fails. Why is this? 
Take the following example: 


Assume you want to estimate a simple linear regression model with a constant 
and a single regressor X. As mentioned above, for perfect multicollinearity to 
be present X has to be a linear combination of the other regressors. Since the 
only other regressor is a constant (think of the right hand side of the model 
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equation as o x 1 + 6,X; + u; so that 6) is always multiplied by 1 for every 
observation), X has to be constant as well. For 3, we have 





Â = DaX- -F _ Cov(X,Y) 
i wea x) Var(X) ` 


(6.7) 
The variance of the regressor X is in the denominator. Since the variance of a 
constant is zero, we are not able to compute this fraction and 6 is undefined. 


Note: In this special case the denominator in (6.7) equals zero, too. Can you 
show that? 


Let us consider two further examples where our selection of regressors induces 
perfect multicollinearity. First, assume that we intend to analyze the effect of 
class size on test score by using a dummy variable that identifies classes which 
are not small (NS). We define that a school has the NS attribute when the 
school’s average student-teacher ratio is at least 12, 


0, if STR < 12 
ia oe 
1 otherwise. 
We add the corresponding column to CASchools and estimate a multiple regres- 
sion model with covariates computer and english. 


# if STR smaller 12, NS = 0, else NS = 1 
CASchools$NS <- ifelse(CASchools$STR < 12, 0, 1) 


# estimate the model 
mult.mod <- lm(score ~ computer + english + NS, data = CASchools) 


# obtain a model summary 

summary (mult .mod) 

#> 

#> Call: 

#> lm(formula = score ~ computer + english + NS, data = CASchools) 
#> 

#> Residuals: 


#> Min 1Q Median 3Q Maz 

Bo IRIA 95910 SO (On O BAS 1 IS 

#> 

#> Coefficients: (1 not defined because of singularities) 
#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 663. 704837 0.984259 674.319 < 2e-16 *** 
#> computer 0.005374 0.001670 3.218 0.00139 ** 


#> english -0.708947 0.040303 -17.591 < 2e-16 *** 
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#> NS NA NA NA NA 

#> --- 

#> Stgnif. codes: O '*#*' 0.001 kA 0.01 '*' 0.05 7." O.1 ' * 1 
#> 

#> Residual standard error: 14.43 on 417 degrees of freedom 

#> Multiple R-squared: 0.4291, Adjusted R-squared: 0.4263 


#> F- statistic: 156.7 on 2 and 417 DE, p-values < 2.2e-16 


Again, the output of summary (mult .mod) tells us that inclusion of NS in the 
regression would render the estimation infeasible. What happened here? This 
is an example where we made a logical mistake when defining the regressor 
NS: taking a closer look at NS, the redefined measure for class size, reveals 
that there is not a single school with STR < 12 hence NS equals one for all 
observations. We can check this by printing the contents of CASchools\$NS or 
by using the function table(), see ?table. 


table (CASchools$NS) 
#> 

#> 1 

#> 420 


CASchools$NS is a vector of 420 ones and our data set includes 420 observations. 
This obviously violates assumption 4 of Key Concept 6.4: the observations for 
the intercept are always 1, 


intercept =à- NS 


1 1 
| =à]: 
1 1 

erA=1. 


Since the regressors can be written as a linear combination of each other, we face 
perfect multicollinearity and R excludes NS from the model. Thus the take-away 
message is: think carefully about how the regressors in your models relate! 


Another example of perfect multicollinearity is known as the dummy variable 
trap. This may occur when multiple dummy variables are used as regressors. A 
common case for this is when dummies are used to sort the data into mutually 
exclusive categories. For example, suppose we have spatial information that 
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indicates whether a school is located in the North, West, South or East of the 
U.S. This allows us to create the dummy variables 


1 if located in the north 


otherwise 


if located in the west 


1 
0 otherwise 


otherwise 








if located in the east 


Sai t if located in the south 
0 

1 

0 otherwise. 


Since the regions are mutually exclusive, for every school i = 1,...,n we have 


North, + West; + South; + East; = 1. 


We run into problems when trying to estimate a model that includes a constant 
and all four direction dummies in the model, e.g., 


TestScore = 09 +6, xSTR+(2xenglish+63 x North;+$4x West;+5 x South;+ Bg x East; +u; 


since then for all observations 1 = 1,...,n the constant term is a linear a 
nation of the dummies: 
intercept = 1 - (North + West + South + East) (6.2) 
1 1 
zils (6.3) 
1 1 
S =1 (6.4) 


and we have perfect multicollinearity. Thus the “dummy variable trap” means 
not paying attention and falsely including exhaustive dummies and a constant 
in a regression model. 


How does 1m() handle a regression like (6.8)? Let us first generate some artificial 
categorical data and append a new column named directions to CASchools 
and see how 1m() behaves when asked to estimate the model. 
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# set seed for reproducibility 
set .seed(1) 


# generate artificial data on location 

CASchools$direction <- sample(c("West", "North", "South", "East"), 
420, 
replace = T) 


# estimate the model 
mult.mod <- lm(score ~ STR + english + direction, data = CASchools) 


# obtain a model summary 

summary (mult .mod) 

#> 

#> Call: 

#> lm(formula = score ~ STR + english + direction, data = CASchools) 
#> 

#> Residuals: 


#> Min 1Q Median 3Q Maz 

#> -49.603 -10.175 -0.484 97524 42.830 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) 684.80477 7.54130 90.807 < 2e-16 *** 


#> STR -1.08873 0.38153 -2.854 0.00454 ** 

#> english -0.65597 0.04018 -16.325 < 2e-16 *** 

#> directtonNorth 1.66314 2.05870 0.808 0.41964 

#> directionSouth 0.71619 2.06321 0.347 0.72867 

#> dtrectionWest Dei Goon IETA 0.905 0.36598 

o 

Ho Signi codes: (0 xxx! OTOOTO O OOS E O, 
#> 

#> Residual standard error: 14.5 on 414 degrees of freedom 

#> Multiple R-squared: 0.4279, Adjusted R-squared: 0.421 


#> F statistic: 61.92 on 5 and 414 DF, p-value: < 2.2e-16 


Notice that R solves the problem on its own by generating and including the 
dummies directionNorth, directionSouth and directionWest but omitting 
directionEast. Of course, the omission of every other dummy instead would 
achieve the same. Another solution would be to exclude the constant and to 
include all dummies instead. 


Does this mean that the information on schools located in the East is lost? 
Fortunately, this is not the case: exclusion of directEast just alters the inter- 
pretation of coefficient estimates on the remaining dummies from absolute to 
relative. For example, the coefficient estimate on directionNorth states that, 
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on average, test scores in the North are about 1.61 points higher than in the 
East. 


A last example considers the case where a perfect linear relationship arises from 
redundant regressors. Suppose we have a regressor PctES, the percentage of 
English speakers in the school where 


PctES = 100 — PctEL 


and both PctES and PctEL are included in a regression model. One regressor 
is redundant since the other one conveys the same information. Since this 
obviously is a case where the regressors can be written as linear combination, 
we end up with perfect multicollinearity, again. 


Let us do this in R. 


# Percentage of english speakers 
CASchools$PctES <- 100 - CASchools$english 


# estimate the model 
mult.mod <- 1lm(score ~ STR + english + PctES, data = CASchools) 


# obtain a model summary 

summary (mult .mod) 

#> 

#> Call: 

#> lm(formula = score ~ STR + english + PctES, data = CASchools) 
#> 

#> Residuals: 


#> Min 1Q Median 3Q Maz 

#> -48.845 -10.240 -0.308 9.815 43.461 

#> 

#> Coefficients: (1 not defined because of singularities) 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 686.03224 7.41131 92.566 < 2e-16 *** 

#> STR -1.10130 0.38028 -2.896 0.00398 ** 

#> english -0. 64978 0.03934 -16.516 < 2e-16 *** 

#> PctES NA NA NA NA 

e == 

A> Signij- codes: TOA O OOTA OTOI EOT OSO 21 
#> 

#> Residual standard error: 14.46 on 417 degrees of freedom 

#> Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237 


#> F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16 


Once more, 1m() refuses to estimate the full model using OLS and excludes 
PctES. 
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See Chapter 18.1 of the book for an explanation of perfect multicollinearity and 
its consequences to the OLS estimator in general multiple regression models 
using matrix notation. 


Imperfect Multicollinearity 


As opposed to perfect multicollinearity, imperfect multicollinearity is — to a 
certain extent — less of a problem. In fact, imperfect multicollinearity is the 
reason why we are interested in estimating multiple regression models in the first 
place: the OLS estimator allows us to isolate influences of correlated regressors 
on the dependent variable. If it was not for these dependencies, there would not 
be a reason to resort to a multiple regression approach and we could simply work 
with a single-regressor model. However, this is rarely the case in applications. 
We already know that ignoring dependencies among regressors which influence 
the outcome variable has an adverse effect on estimation results. 


So when and why is imperfect multicollinearity a problem? Suppose you have 
the regression model 


Yi = bo + Pi Xi + BoXai + ui (6.9) 


and you are interested in estimating 61, the effect on Y; of a one unit change in 
Xıi, while holding X2; constant. You do not know that the true model indeed 
includes Xə. You follow some reasoning and add Xə as a covariate to the 
model in order to address a potential omitted variable bias. You are confident 
that E(ui|X1;, X2;) = 0 and that there is no reason to suspect a violation of 
the assumptions 2 and 3 made in Key Concept 6.4. If Xı and Xə are highly 
correlated, OLS struggles to precisely estimate 81. That means that although 
Bi is a consistent and unbiased estimator for 3, it has a large variance due to 
Xə being included in the model. If the errors are homoskedastic, this issue can 
be better understood from the formula for the variance of 3, in the model (6.9) 
(see Appendix 6.2 of the book): 


1 1 g? 

2 u 
on = : (6.10) 

Pron (=) ox, 


First, if px,,x, = 0, i.e., if there is no correlation between both regressors, 





including Xə in the model has no influence on the variance of Ay. Secondly, 


if Xı and Xə are correlated, oF is inversely proportional to 1 — py, x, so the 


stronger the correlation between Xı and X2, the smaller is 1 — Ox. x, and 
thus the bigger is the variance of Gi hirdly, increasing the sample size helps 
to reduce the variance of Bi. Of course, this is not limited to the case with 
two regressors: in multiple regressions, imperfect multicollinearity inflates the 
variance of one or more coefficient estimators. It is an empirical question which 
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coefficient estimates are severely affected by this and which are not. When 
the sample size is small, one often faces the decision whether to accept the 
consequence of adding a large number of covariates (higher variance) or to use 
a model with only few regressors (possible omitted variable bias). This is called 
bias-variance trade-off. 


In sum, undesirable consequences of imperfect multicollinearity are generally 
not the result of a logical error made by the researcher (as is often the case for 
perfect multicollinearity) but are rather a problem that is linked to the data 
used, the model to be estimated and the research question at hand. 


Simulation Study: Imperfect Multicollinearity 


Let us conduct a simulation study to illustrate the issues sketched above. 


1. We use (6.9) as the data generating process and choose bo = 5, 31 = 2.5 
and 89 = 3 and u; is an error term distributed as N(0,5). In a first step, 
we sample the regressor data from a bivariate normal distribution: 


Xi = (Xii, X25) REN (6) ; e a 


It is straightforward to see that the correlation between X, and X2 in the 
population is rather low: 


_ Cov(X1, X2) _ 2.5 = 0 25 
eine J/Var(X1)/Var(X2) 10 








2. Next, we estimate the model (6.9) and save the estimates for 6, and 82. 
This is repeated 10000 times with a for loop so we end up with a large 
number of estimates that allow us to describe the distributions of 6; and 


p2. 


3. We repeat steps 1 and 2 but increase the covariance between X, and X3 
from 2.5 to 8.5 such that the correlation between the regressors is high: 


B Cov(X1, X2) o 8.5 0.85 
PAi Xa yVar(Xı)yVar(Xə) 10 i 








4. In order to assess the effect on the precision of the estimators of increasing 
the collinearity between Xı and Xə we estimate the variances of 8; and 
B2 and compare. 
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# load packages 
library (MASS) 
library (mvtnorm) 


# set number of observations 
n <- 50 


# initialize vectors of coefficients 
coefsi <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000) ) 
coefs2 <- coefsl 


# set seed 
set.seed(1) 


# loop sampling and estimation 
for Gi in 1:10000)° { 


# for cov(X_1,X_2) = 0.25 

X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 2.5), c(2.5, 10))) 
u <- rnorm(n, sd = 5) 

Nee ce i a GE, GU] Gash ce dy Pal sen 

coefsi[i, ] <- l1m(Y ~ X[, 1] + XL, 2])$coefficients[-1] 


# for cov(X_1,X_2) = 0.85 

X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 8.5), c(8.5, 10))) 
Nexo 1 ce Pea ee OL, ll) op ey oe 0, Pal ce at 

coefs2[i, ] <- l1m(Y ~ X[, 1] + XL, 2])$coefficients[-1] 


# obtain variance estimates 
diag (var (coefs1)) 

#> hat_beta_1 hat_beta_2 
#> 0.05674375 0.05712459 
diag (var (coefs2)) 

#> hat_beta_1 hat_beta_2 
#> 0.1904949 0.1909056 


We are interested in the variances which are the diagonal elements. We see that 
due to the high collinearity, the variances of 8; and 62 have more than tripled, 
meaning it is more difficult to precisely estimate the true coefficients. 
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6.5 The Distribution of the OLS Estimators in 
Multiple Regression 


As in simple linear regression, different samples will produce different values 
of the OLS estimators in the multiple regression model. Again, this variation 
leads to uncertainty of those estimators which we seek to describe using their 
sampling distribution(s). In short, if the assumption made in Key Concept 6.4 
hold, the large sample distribution of Bo, A, aie , Br is multivariate normal such 
that the individual estimators themselves are also normally distributed. Key 
Concept 6.5 summarizes the corresponding statements made in Chapter 6.6 of 
the book. A more technical derivation of these results can be found in Chapter 
18 of the book. 


Key Concept 6.5 TA i 
Large-sample distribution of 6o, 3,,..., 5, 


If the least squares assumptions in the multiple regression model (see 


Key Concept 6.4) hold, then, in large samples, the OLS estimators 
Bo, 61; ---, 8k are jointly normally distributed. We also say that their 
joint distribution is multivariate normal. Further, each (; is distributed 


as N(5;,03,). 





Essentially, Key Concept 6.5 states that, if the sample size is large, we can ap- 
proximate the individual sampling distributions of the coefficient estimators by 
specific normal distributions and their joint sampling distribution by a multi- 
variate normal distribution. 


How can we use R to get an idea of what the joint PDF of the coefficient 
estimators in multiple regression model looks like? When estimating a model 
on some data, we end up with a set of point estimates that do not reveal 
much information on the joint density of the estimators. However, with a large 
number of estimations using repeatedly randomly sampled data from the same 
population we can generate a large set of point estimates that allows us to plot 
an estimate of the joint density function. 


The approach we will use to do this in R is a follows: 
e Generate 10000 random samples of size 50 using the DGP 
Y; = 5 + 2.5 - Xii +3- Xoi + ui 


where the regressors X1; and X2; are sampled for each observation as 


Xi = (Xii, Xai) ~ N (6) , & al 
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and 


is an error term. 
e For each of the 10000 simulated sets of sample data, we estimate the model 
Yi = Bo + PiX1i + b2Xoi + ui 


and save the coefficient estimates 3, and fz. 





e We compute a density estimate of the joint distribution of By and Bo in 
the model above using the function kde2d() from the package MASS, see 
?MASS. This estimate is then plotted using the function persp(). 


# load packages 
library (MASS) 
library (mvtnorm) 


# set sample size 
a< 50 


# initialize vector of coefficients 
coefs <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000) ) 


# set seed for reproducibility 
set .seed(1) 


# loop sampling and estimation 
for (an te 10000) 4 


X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 2.5), c(2.5, 10))) 
u <- rnorm(n, sd = 5) 

Y< E2 EEX di) +93 4 X07 2) 4 ou 

coefs[i,] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1] 


# compute density estimate 
kde <- kde2d(coefs[, 1], coefs[, 2]) 


# plot density estimate 
persp (kde, 

theta =1310; 

pha =" 30; 

xlab = "beta_1i", 

ylab = “betas2", 

zlab = "Est. Density") 


180 CHAPTER 6. REGRESSION MODELS WITH MULTIPLE REGRESSORS 





From the plot above we can see that the density estimate has some similarity 
to a bivariate normal distribution (see Chapter 2) though it is not very pretty 
and probably a little rough. Furthermore, there is a correlation between the 
estimates such that p Æ 0 in (2.1). Also, the distribution’s shape deviates from 
the symmetric bell shape of the bivariate standard normal distribution and has 
an elliptical surface area instead. 


# estimate the correlation between estimators 
cor(coefs[, 1], coefs[, 2]) 
#> [1] -0.2503028 


Where does this correlation come from? Notice that, due to the way we gener- 
ated the data, there is correlation between the regressors Xı and X2. Correla- 
tion between the regressors in a multiple regression model always translates into 
correlation between the estimators (see Appendix 6.2 of the book). In our case, 
the positive correlation between X, and Xə translates to negative correlation 
between ĝi and Bo. To get a better idea of the distribution you can vary the 
point of view in the subsequent smooth interactive 3D plot of the same density 
estimate used for plotting with persp(). Here you can see that the shape of 
the distribution is somewhat stretched due to p = —0.20 and it is also apparent 
that both estimators are unbiased since their joint density seems to be centered 
close to the true parameter vector (31, G2) = (2.5,3). 


This interactive part of the book is only available in the HTML version. 


6.6 Exercises 


This interactive part of the book is only available in the HTML version. 


Chapter 7 


Hypothesis Tests and 
Confidence Intervals in 
Multiple Regression 


This chapter discusses methods that allow to quantify the sampling uncertainty 
in the OLS estimator of the coefficients in multiple regression models. The basis 
for this are hypothesis tests and confidence intervals which, just as for the simple 
linear regression model, can be computed using basic R functions. We will also 
tackle the issue of testing joint hypotheses on these coefficients. 


Make sure the packages AER (Kleiber and Zeileis, 2020) and stargazer (Hlavac, 
2018) are installed before you go ahead and replicate the examples. The safest 
way to do so is by checking whether the following code chunk executes without 
any issues. 


library (AER) 
library (stargazer) 


7.1 Hypothesis Tests and Confidence Intervals 
for a Single Coefficient 


We first discuss how to compute standard errors, how to test hypotheses and 
how to construct confidence intervals for a single regression coefficient 3; in a 
multiple regression model. The basic idea is summarized in Key Concept 7.1. 
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Key Concept 7.1 
Testing the Hypothesis 8; = jo Against the Alternative 6; # 
B50 


1. Compute the standard error of b; 


2. Compute the t-statistic, 


3. Compute the p-value, 


p-value = 26(—|t*“|) 

where t°% is the value of the t-statistic actually computed. Reject 
the hypothesis at the 5% significance level if the p-value is less than 
0.05 or, equivalently, if |t*"| > 1.96. 


The standard error and (typically) the t-statistic and the corresponding 
p-value for testing 6; = 0 are computed automatically by suitable R 
functions, e.g., by summary (). 





Testing a single hypothesis about the significance of a coefficient in the multiple 
regression model proceeds as in in the simple regression model. 


You can easily see this by inspecting the coefficient summary of the regression 
model 


TestScore = bo + bı X sizeB2 x english + u 
already discussed in Chapter 6. Let us review this: 


model <- lm(score ~ size + english, data = CASchools) 
coeftest(model, vcov. = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 686.032245 8.728225 ‘78.5993 < 2e-16 *** 
#> size -1.101296 0.432847 -2.5443 0.01131 * 
#> english -0.649777 0.031032 -20.9391 < 2e-16 FE 
#> --- 


H Signi- codes OILZ TOTOO MEE! TONOL A # SOROS rs (Olt Ud, 
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You may check that these quantities are computed as in the simple regression 
model by computing the t-statistics or p-values by hand using the output above 
and R as a calculator. 


For example, using the definition of the p-value for a two-sided test as given in 
Key Concept 7.1, we can confirm the p-value for a test of the hypothesis that 
the coefficient 81, the coefficient on size, to be approximately zero. 


# compute two-sided p-value 

2 * (1 - pt(abs(coeftest (model, vcov. = vcovHC, type = "HCi")[2, 3]), 
df = model$df.residual) ) 

#> [1] 0.01130921 


Key Concept 7.2 
Confidence Intervals for a Single Coefficient in Multiple Re- 
gression 


A 95% two-sided confidence interval for the coefficient 6; is an interval 
that contains the true value of 8; with a 95% probability; that is, it 
contains the true value of 6; in 95% of repeated samples. Equivalently, 
it is the set of values of 6; that cannot be rejected by a 5% two-sided hy- 
pothesis test. When the sample size is large, the 95% confidence interval 
for 8; is 


[6 — 1.96 x SE(Ê;), Bj + 1.96 x SE(6;)| 





7.2 An Application to Test Scores and the 
Student-Teacher Ratio 


Let us take a look at the regression from Section 6.3 again. 


Computing confidence intervals for individual coefficients in the multiple re- 
gression model proceeds as in the simple regression model using the function 
confint(). 


model <- lm(score ~ size + english, data = CASchools) 


conf int (model) 

#> 25 4 OTe inh 
#> (Intercept) 671.4640580 700.6004311 
#> size -1.8487969 -0.3537944 


#> english -0.7271119 -0.5724424 
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To obtain confidence intervals at another level, say 90%, just set the argument 
level in our call of confint() accordingly. 


confint(model, level = 0.9) 


Se 5% 95 % 
#> (Intercept) 673.8145793 698. 2499098 
# size -1.7281904 -0.4744009 
#> english -0.7146336 -0.5849200 


The output now reports the desired 90% confidence intervals for all coefficients. 


A disadvantage of confint() is that is does not use robust standard errors to 
compute the confidence interval. For large-sample confidence intervals, this is 
quickly done manually as follows. 


# compute robust standard errors 
rob_se <- diag(vcovHC(model, type = "HC1i"))°0.5 


# compute robust 95% confidence intervals 

rbind("lower" = coef(model) - qnorm(0.975) * rob_se, 
"upper" = coef(model) + qnorm(0.975) * rob_se) 

#> (Intercept) size english 

#> lower 668.9252 -1.9496606 -0. 7105980 

#> upper 703.1393 —0'2529807 -0. 5889557, 


# compute robust 90% confidence intervals 


rbind("lower" = coef(model) - qnorm(0.95) * rob_se, 
"upper" = coef(model) + qnorm(0.95) * rob_se) 
#> (Intercept) size english 


#> lower 671.6756 -1.8132659 -0.'7008195 
#> upper 700.3889 -0.3893254 -0.5987341 


Knowing how to use R to make inference about the coefficients in multiple 
regression models, you can now answer the following question: 


Can the null hypothesis that a change in the student-teacher ratio, size, has no 
significant influence on test scores, scores, — if we control for the percentage 
of students learning English in the district, english, — be rejected at the 10% 
and the 5% level of significance? 


The output above shows that zero is not an element of the confidence interval for 
the coefficient on size such that we can reject the null hypothesis at significance 
levels of 5% and 10%. The same conclusion can be made via the p-value for 
size: 0.00398 < 0.05 =a. 


Note that rejection at the 5%-level implies rejection at the 10% level (why?). 
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Recall from Chapter 5.2 the 95% confidence interval computed above does not 
tell us that a one-unit decrease in the student-teacher ratio has an effect on 
test scores that lies in the interval with a lower bound of —1.9497 and an upper 
bound of —0.2529. Once a confidence interval has been computed, a probabilistic 
statement like this is wrong: either the interval contains the true parameter or 
it does not. We do not know which is true. 


Another Augmentation of the Model 


What is the average effect on test scores of reducing the student-teacher ratio 
when the expenditures per pupil and the percentage of english learning pupils 
are held constant? 


Let us augment our model by an additional regressor that is a measure for 
expenditure per pupil. Using ?CASchools we find that CASchools contains the 
variable expenditure, which provides expenditure per student. 


Our model now is 


TestScore = Bo + Bı x size + b2 x english + B3 x expenditure + u 


with expenditure the total amount of expenditure per pupil in the district 
(thousands of dollars). 


Let us now estimate the model: 


# scale expenditure to thousands of dollars 
CASchools$expenditure <- CASchools$expenditure/1000 


# estimate the model 
model <- lm(score ~ size + english + expenditure, data = CASchools) 
coeftest (model, vcov. = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 649.577947 15.458344 42.0212 < 2e-16 *** 

#> size -0.286399 0.482073 -0.5941 0.55277 

#> english -0.656023 0.031784 -20.6398 < 2e-16 *** 

#> expenditure 3.867901 1.580722 2.4469 0.01482 * 

#> --- 

#> Signi codes: 0 “xxx! 02001 Ak OTOI "*! OL05 "." OLb Y ) a 


The estimated effect of a one unit change in the student-teacher ratio on test 
scores with expenditure and the share of english learning pupils held constant is 
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—0.29, which is rather small. What is more, the coefficient on size is not signif- 
icantly different from zero anymore even at 10% since p-value = 0.55. Can you 
come up with an interpretation for these findings (see Chapter 7.1 of the book)? 
The insignificance of £1 could be due to a larger standard error of Bi resulting 
from adding expenditure to the model so that we estimate the coefficient on 
size less precisely. This illustrates the issue of strongly correlated regressors 
(imperfect multicollinearity). The correlation between size and expenditure 
can be computed using cor(). 


# compute the sample correlation between 'size' and ‘expenditure’ 
cor(CASchools$size, CASchools$expenditure) 
#> [1] -0.6199822 


Altogether, we conclude that the new model provides no evidence that changing 
the student-teacher ratio, e.g., by hiring new teachers, has any effect on the test 
scores while keeping expenditures per student and the share of English learners 
constant. 


7.3 Joint Hypothesis Testing Using the F- 
Statistic 


The estimated model is 


TestScore = 649.58 — 0.29 x size — 0.66 x english + 3.87 x expenditure. 
(15.21) (0.48) (0.04) (1.41) 


Now, can we reject the hypothesis that the coefficient on size and the coefficient 
on expenditure are zero? To answer this, we have to resort to joint hypothesis 
tests. A joint hypothesis imposes restrictions on multiple regression coefficients. 
This is different from conducting individual t-tests where a restriction is imposed 
on a single coefficient. Chapter 7.2 of the book explains why testing hypotheses 
about the model coefficients one at a time is different from testing them jointly. 


The homoskedasticity-only F-Statistic is given by 


(SS'Rrestricted = S'S Runrestricted)/G 
SS Runrestrictea / (n —k- 1) 





F= 


with SS Rrestrictea being the sum of squared residuals from the restricted re- 
gression, i.e., the regression where we impose the restriction. SS Runrestricted iS 
the sum of squared residuals from the full model, q is the number of restrictions 
under the null and k is the number of regressors in the unrestricted regression. 
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It is fairly easy to conduct F-tests in R. We can use the function 
linearHypothesis()contained in the package car. 


# estimate the multiple regression model 
model <- lm(score ~ size + english + expenditure, data = CASchools) 


# execute the function on the model object and provide both linear restrictions 
# to be tested as strings 

linearHypothesis(model, c("size=0", "expenditure=0") ) 

#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> size = 0 

#> expenditure = 0 

#> 

#> Model 1: restricted model 

#> Model 2: score ~ size + english + expenditure 

#> 

#> Res.Df RSS Df Sum of Sq F  Pr(>F) 

#> 1 418 89000 

#> 2 416 85700 2 3300.3 8.0101 0.000386 *** 

(SSS 

#> Signi- codes: Orr 10001 Ike! O01 *! ONO YS Ort ) 7 


The output reveals that the F-statistic for this joint hypothesis test is about 
8.01 and the corresponding p-value is 0.0004. Thus, we can reject the null 
hypothesis that both coefficients are zero at any level of significance commonly 
used in practice. 


A heteroskedasticity-robust version of this F-test (which leads to the same con- 
clusion) can be conducted as follows. 


# heteroskedasticity-robust F-test 
linearHypothesis(model, c("size=0", “expenditure=0"), white.adjust = "hci") 
#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> size = 0 

#> expenditure = 0 

#> 

#> Model 1: restricted model 

#> Model 2: score ~ size + english + expenditure 
#> 

#> Note: Coefficient covariance matrix supplied. 
#> 

#> Res.Df Df F PrOEF) 
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A1 213 
#> 2 416 2 5.4337 0.004682 ** 

p 

#> Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.'0.1'!'1 


The standard output of a model summary also reports an F-statistic and the 
corresponding p-value. The null hypothesis belonging to this F-test is that all 
of the population coefficients in the model except for the intercept are zero, so 
the hypotheses are 





Ho: 6; =0, 62 =0, 63 =0 vs. Hı: 8; £0 for at least one j = 1,2,3. 





This is also called the overall regression F'-statistic and the null hypothesis is 
obviously different from testing if only 6, and {3 are zero. 


We now check whether the F-statistic belonging to the p-value listed in the 
model’s summary coincides with the result reported by linearHypothesis(). 


# execute the function on the model object and provide the restrictions 
# to be tested as a character vector 

linearHypothesis(model, c("size=0", "“english=0", "expenditure=0")) 
#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> size = 0 

#> english = 0 

#> expenditure = 0 

#> 

#> Model 1: restricted model 

#> Model 2: score ~ size + english + expenditure 


#> 

#> Res.Df RSS Df Sum of Sq F Pr (>F) 

#> 1 419 152110 

#2 416 85700 3 66410 107.45 < 2.2e-16 *** 

fee 

#> Stgntf. codes: 0 ‘***' 0.001 ‘**' 0.01 '*' 0.05 "." 0.1 ' " 1 


# Access the overall F-statistic from the model's summary 
summary (model) $fstatistic 

#> value numa f dendf 

#> 107.4547 3.0000 416.0000 


The entry value is the overall F-statistics and it equals the result of 
linearHypothesis(). The F-test rejects the null hypothesis that the model 
has no power in explaining test scores. It is important to know that the 
F-statistic reported by summary is not robust to heteroskedasticity! 


7.4. CONFIDENCE SETS FOR MULTIPLE COEFFICIENTS 189 


7.4 Confidence Sets for Multiple Coefficients 


Based on the F-statistic that we have previously encountered, we can specify 
confidence sets. Confidence sets are analogous to confidence intervals for single 
coefficients. As such, confidence sets consist of combinations of coefficients that 
contain the true combination of coefficients in, say, 95% of all cases if we could 
repeatedly draw random samples, just like in the univariate case. Put differently, 
a confidence set is the set of all coefficient combinations for which we cannot 
reject the corresponding joint null hypothesis tested using an F-test. 


The confidence set for two coefficients an ellipse which is centered around the 
point defined by both coefficient estimates. Again, there is a very convenient 
way to plot the confidence set for two coefficients of model objects, namely the 
function confidenceEllipse() from the car package. 


We now plot the 95% confidence ellipse for the coefficients on size and 
expenditure from the regression conducted above. By specifying the 
additional argument fill, the confidence set is colored. 


# draw the 95% confidence set for coefficients on size and expenditure 
confidenceEllipse (model, 


fill =T, 

lwd = 0, 

which.coef = c("size", "expenditure"), 
main = "95% Confidence Set") 


95% Confidence Set 
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We see that the ellipse is centered around (—0.29,3.87), the pair of coefficients 
estimates on size and expenditure. What is more, (0,0) is not element of the 
95% confidence set so that we can reject Ho : 61 = 0, 63 = 0. 
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By default, confidenceEllipse() uses homoskedasticity-only standard errors. 
The following code chunk shows how compute a robust confidence ellipse and 
how to overlay it with the previous plot. 


# draw the robust 95% confidence set for coefficients on size and expenditure 
confidenceEllipse(model, 


fill =T, 

lwd = 0, 

which.coef = c("size", "expenditure"), 
main = "95% Confidence Sets", 


vcov. = vcovHC(model, type = "HCi"), 
col = "red") 


# draw the 95% confidence set for coefficients on size and expenditure 
confidenceEllipse(model, 


fill =T, 
lwd = 0, 
which.coef = c("size", "expenditure"), 
add = T) 
95% Confidence Sets 
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As the robust standard errors are slightly larger than those valid under ho- 
moskedasticity only in this case, the robust confidence set is slightly larger. 
This is analogous to the confidence intervals for the individual coefficients. 


7.5 Model Specification for Multiple Regression 


Choosing a regression specification, i.e., selecting the variables to be included 
in a regression model, is a difficult task. However, there are some guidelines on 
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how to proceed. The goal is clear: obtaining an unbiased and precise estimate of 
the causal effect of interest. As a starting point, think about omitted variables, 
that is, to avoid possible bias by using suitable control variables. Omitted 
variables bias in the context of multiple regression is explained in Key Concept 
7.3. A second step could be to compare different specifications by measures of 
fit. However, as we shall see one should not rely solely on R?. 


Key Concept 7.3 
Omitted Variable Bias in Multiple Regression 


Omitted variable bias is the bias in the OLS estimator that arises when 
regressors correlate with an omitted variable. For omitted variable bias 
to arise, two things must be true: 


. At least one of the included regressors must be correlated with the 
omitted variable. 


. The omitted variable must be a determinant of the dependent vari- 
able, Y. 





We now discuss an example were we face a potential omitted variable bias in a 
multiple regression model: 


Consider again the estimated regression equation 


TestScore = 686.0 — 1.10 x size — 0.650 x english. 
(8.7) (0.43) (0.031) 


We are interested in estimating the causal effect of class size on test score. 
There might be a bias due to omitting “outside learning opportunities” from 
our regression since such a measure could be a determinant of the students’ 
test scores and could also be correlated with both regressors already included in 
the model (so that both conditions of Key Concept 7.3 are fulfilled). “Outside 
learning opportunities” are a complicated concept that is difficult to quantify. A 
surrogate we can consider instead is the students’ economic background which 
likely are strongly related to outside learning opportunities: think of wealthy 
parents that are able to provide time and/or money for private tuition of their 
children. We thus augment the model with the variable lunch, the percentage 
of students that qualify for a free or subsidized lunch in school due to family 
incomes below a certain threshold, and reestimate the model. 


# estimate the model and print the summary to console 

model <- lm(score ~ size + english + lunch, data = CASchools) 
coeftest (model, vcov. = vcovHC, type = "HC1") 

#> 
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#> t test of coefficients: 


#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 700.149957 5.568453 125.7351 < 2.2e-16 *** 

#> size -0.998309 0.270080 -3.6963 0.0002480 *** 

#> english =Qriclove, 0.032832 -3.7029 0.0002418 *** 

#> Lunch -0. 547345 0.024107 -22.7046 < 2.2e-16 *** 

#> --- 

#> Signif- codes: WO TAk IOROOT ERR! O ON eI SONOS ORL 


Thus, the estimated regression line is 


TestScore = 700.15 — 1.00 x size — 0.12 x english — 0.55 x lunch. 
(5.56) (0.27) (0.03) (0.02) 


We observe no substantial changes in the conclusion about the effect of size on 
TestScore: the coefficient on size changes by only 0.1 and retains its signifi- 
cance. 


Although the difference in estimated coefficients is not big in this case, it is 
useful to keep lunch to make the assumption of conditional mean independence 
more credible (see Chapter 7.5 of the book). 


Model Specification in Theory and in Practice 


Key Concept 7.4 lists some common pitfalls when using R? and R? to evaluate 
the predictive ability of regression models. 
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Key Concept 7.4 
R? and R?: What They Tell You — and What They Do not 


The R? and R? tell you whether the regressors are good at explaining 
the variation of the independent variable in the sample. If the R? (or 
R?) is nearly 1, then the regressors produce a good prediction of the 
dependent variable in that sample, in the sense that the variance of OLS 
residuals is small compared to the variance of the dependent variable. 
If the R? (or R?) is nearly 0, the opposite is true. 


The R? and R? do not tell you whether: 


. An included variable is statistically significant. 


. The regressors are the true cause of the movements in the depen- 
dent variable. 


. There is omitted variable bias. 


. You have chosen the most appropriate set of regressors. 





For example, think of regressing TestScore on PLS which measures the avail- 
able parking lot space in thousand square feet. You are likely to observe a 
significant coefficient of reasonable magnitude and moderate to high values for 
R? and R?. The reason for this is that parking lot space is correlated with many 
determinants of the test score like location, class size, financial endowment and 
so on. Although we do not have observations on PLS, we can use R to generate 
some relatively realistic data. 


# set seed for reproducibility 
set .seed(1) 


# generate observations for parking lot space 
CASchools$PLS <- c(22 * CASchools$income 
- 15 * CASchools$size 
+ 0.2 * CASchools$expenditure 
+ rnorm(nrow(CASchools), sd = 80) + 3000) 


# plot parking lot space against test score 
plot (CASchools$PLS, 

CASchools$score, 

xlab = "Parking Lot Space", 

ylab = “lest: Score, 

peni =: 20h 

col = "steelblue") 
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# regress test score on PLS 
summary(1m(score ~ PLS, data = CASchools)) 


#> 

#> Call: 

#> lm(formula = score ~ PLS, data = CASchools) 

#> 

#> Residuals: 

#> Min 1Q Median 3Q Maz 

#> -42.608 -11.049 0.342 12.558 37.105 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 4.897e+02 1.227e+01 39.90 <2e-16 *** 

#> PLS 4.002e-02 2.981e-03 13.43 <Ze-16 *** 

SS 

HO Stgnip. codes 10 “xx! 07001 tee! OR01 URN SONOS Oe tt 
#> 

#> Residual standard error: 15.95 on 418 degrees of freedom 

#> Multiple R-squared: 0.3013, Adjusted R-squared: 0.2996 


#> F- statistic: 180.2 on 1 and 418 DF, p-value: < 2.2e-16 


PLS is generated as a linear function of expenditure, income, size and a ran- 
dom disturbance. Therefore the data suggest that there is some positive rela- 
tionship between parking lot space and test score. In fact, when estimating the 
model 


TestScore = bo + bı x PLS +u (7.1) 


using 1m() we find that the coefficient on PLS is positive and significantly 
different from zero. Also R? and R? are about 0.3 which is a lot more than 
the roughly 0.05 observed when regressing the test scores on the class sizes 
only. This suggests that increasing the parking lot space boosts a school’s test 
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scores and that model (7.1) does even better in explaining heterogeneity in the 
dependent variable than a model with size as the only regressor. Keeping in 
mind how PLS is constructed this comes as no surprise. It is evident that the 
high R? cannot be used to the conclude that the estimated relation between 
parking lot space and test scores is causal: the (relatively) high R? is due 
to correlation between PLS and other determinants and/or control variables. 
Increasing parking lot space is not an appropriate measure to generate more 
learning success! 


7.6 Analysis of the Test Score Data Set 


Chapter 6 and some of the previous sections have stressed that it is important 
to include control variables in regression models if it is plausible that there are 
omitted factors. In our example of test scores we want to estimate the causal 
effect of a change in the student-teacher ratio on test scores. We now provide 
an example how to use multiple regression in order to alleviate omitted variable 
bias and demonstrate how to report results using R. 


So far we have considered two variables that control for unobservable student 
characteristics which correlate with the student-teacher ratio and are assumed 
to have an impact on test scores: 


e English, the percentage of English learning students 


e lunch, the share of students that qualify for a subsidized or even a free 
lunch at school 


Another new variable provided with CASchools is calworks, the percentage of 
students that qualify for the CalWorks income assistance program. Students 
eligible for CalWorks live in families with a total income below the threshold 
for the subsidized lunch program so both variables are indicators for the share 
of economically disadvantaged children. Both indicators are highly correlated: 


# estimate the correlation between 'calworks' and 'lunch' 
cor (CASchools$calworks, CASchools$lunch) 
#> [1] 0.7394218 


There is no unambiguous way to proceed when deciding which variable to use. 
In any case it may not a good idea to use both variables as regressors in view 
of collinearity. Therefore, we also consider alternative model specifications. 


For a start, we plot student characteristics against test scores. 
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# set up arrangement of plots 
m <- rbind(c(1, 2), c(3, 0)) 
graphics::layout(mat = m) 


# scatterplots 
plot(score ~ english, 
data = CASchools, 
col = "steelblue", 
peh ="20); 
xlim = c(0, 100), 
cex.main = 0.9, 
main = "Percentage of English language learners") 


plot(score ~ lunch, 
data = CASchools, 


col = "steelblue", 

peche—= 201, 

cex.main = 0.9, 

main = "Percentage qualifying for reduced price lunch") 


plot(score ~ calworks, 
data = CASchools, 
col = "steelblue", 
pehe—=—20;; 
xlim = c(0, 100), 
cex.main = 0.9, 
main = "Percentage qualifying for income assistance") 


Percentage of English language learners Percentage qualifying for reduced price lun 
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We divide the plotting area up using layout(). The matrix m specifies the 
location of the plots, see ?layout. 
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We see that all relationships are negative. Here are the correlation coefficients. 


# estimate correlation between student characteristics and test scores 
cor (CASchools$score, CASchools$english) 

#> [1] -0.6441238 

cor (CASchools$score, CASchools$lunch) 

#> (ill O08682 

cor(CASchools$score, CASchools$calworks) 

#> [1] -0.6268533 


We shall consider five different model equations: 


) TestScore = Bo + bı x size + u, 

) TestScore = Bo + bı x size + Bo x english + u, 

(III) TestScore = bo + b1 x size + b2 x english + Bs x lunch + u, 
) 
) 


TestScore = Bo + bı X size + b2 x english + 64 x calworks + u, 














TestScore = Bo + bı X size + b2 x english + b3 x lunch + 64 x calworks + u 


The best way to communicate regression results is in a table. The stargazer 
package is very convenient for this purpose. It provides a function that generates 
professionally looking HTML and LaTeX tables that satisfy scientific standards. 
One simply has to provide one or multiple object(s) of class 1m. The rest is done 
by the function stargazer(). 


# load the stargazer library 
library (stargazer) 


# estimate different model specifications 
speci <- lm(score ~ size, data = CASchools) 
spec2 <- lm(score ~ size + english, data = CASchools) 
spec3 <- lm(score ~ size + english + lunch, data = CASchools) 
spec4 <- lm(score ~ size + english + calworks, data = CASchools) 
spec5 <- lm(score ~ size + english + lunch + calworks, data = CASchools) 
# gather robust standard errors in a list 
rob_se <- list(sqrt(diag(vcovHC(speci, type = "HC1i"))), 

sqrt (diag(vcovHC(spec2, type = "HCi"))), 

sqrt (diag(vcovHC(spec3, type = "HCi"))), 

sqrt (diag(vcovHC(spec4, type = "HCi"))), 

sqrt (diag(vcovHC(spec5, type = "HC1")))) 


# generate a LaTeX table using stargazer 
stargazer (spec1, spec2, spec3, spec4, specd, 
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se = rob_se, 
digits = 3, 
header = F, 
column labelsi = c CCI) ECT CTT CTV VOD 


Table 7.1 states that score is the dependent variable and that we consider five 
models. We see that the columns of Table 7.1 contain most of the informa- 
tion provided by coeftest() and summary() for the regression models under 
consideration: the coefficients estimates equipped with significance codes (the 
asterisks) and standard errors in parentheses below. Although there are no 
t-statistics, it is straightforward for the reader to compute them simply by di- 
viding a coefficient estimate by the corresponding standard error. The bottom 
of the table reports summary statistics for each model and a legend. For an in- 
depth discussion of the tabular presentation of regression results, see Chapter 
7.6 of the book. 


What can we conclude from the model comparison? 


1. We see that adding control variables roughly halves the coefficient on size. 
Also, the estimate is not sensitive to the set of control variables used. The 
conclusion is that decreasing the student-teacher ratio ceteris paribus by 
one unit leads to an estimated average increase in test scores of about 1 
point. 


2. Adding student characteristics as controls increases R? and R? from 0.049 
(spec1) up to 0.773 (spec3 and spec5), so we can consider these variables 
as suitable predictors for test scores. Moreover, the estimated coefficients 
on all control variables are consistent with the impressions gained from 
Figure 7.2 of the book. 


3. We see that the control variables are not statistically significant in all 
models. For example in spec5, the coefficient on calworks is not signifi- 
cantly different from zero at 5% since |—0.048/0.059| = 0.81 < 1.64. We 
also observe that the effect on the estimate (and its standard error) of the 
coefficient on size of adding calworks to the base specification spec3 is 
negligible. We can therefore consider calworks as a superfluous control 
variable, given the inclusion of lunch in this model. 


7.7 Exercises 


This interactive part of the book is only available in the HTML version. 
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Chapter 8 


Nonlinear Regression 
Functions 


Until now we assumed the regression function to be linear, i.e., we have treated 
the slope parameter of the regression function as a constant. This implies that 
the effect on Y of a one unit change in X does not depend on the level of X. 
If, however, the effect of a change in X on Y does depend on the value of X, 
we should use a nonlinear regression function. 


Just like for the previous chapter, the packages AER (Kleiber and Zeileis, 2020) 
and stargazer (Hlavac, 2018) are required for reproduction of the code pre- 
sented in this chapter. Check whether the code chunk below executes without 
any error messages. 


library (AER) 
library (stargazer) 


8.1 A General Strategy for Modelling Nonlinear 
Regression Functions 


Let us have a look at an example where using a nonlinear regression function is 
better suited for estimating the population relationship between the regressor, 
X, and the regressand, Y: the relationship between the income of schooling 
districts and their test scores. 


# prepare the data 


library (AER) 
data(CASchools) 
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CASchools$size <- CASchools$students/CASchools$teachers 
CASchools$score <- (CASchools$read + CASchools$math) / 2 


We start our analysis by computing the correlation between both variables. 


cor(CASchools$income, CASchools$score) 
#> [1] 0.7124308 


Here, income and test scores are positively related: school districts with above 
average income tend to achieve above average test scores. Does a linear regres- 
sion function model the data adequately? Let us plot the data and add a linear 
regression line. 


# fit a simple linear model 
linear_model<- lm(score ~ income, data = CASchools) 


# plot the observations 
plot (CASchools$income, CASchools$score, 


col = "steelblue", 
pehi E20 
xlab = "District Income (thousands of dollars)", 


ylab = "Test Score"; 
cex maini OF 9) 
main = "Test Score vs. District Income and a Linear OLS Regression Function") 


# add the regression line to the plot 
abline(linear_model, 

col = "red", 

lwd = 2) 


Test Score vs. District Income and a Linear OLS Regression Function 





Test Score 
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District Income (thousands of dollars) 


8.1. A GENERAL STRATEGY FOR MODELLING NONLINEAR REGRESSION FUNCTIONS203 


As pointed out in the book, the linear regression line seems to overestimate the 
true relationship when income is very high or very low and underestimates it 
for the middle income group. 


Fortunately, OLS does not only handle linear functions of the regressors. We 
can for example model test scores as a function of income and the square of 
income. The corresponding regression model is 


TestScore; = Bo + By X income; + Bo x income? + u, 


called a quadratic regression model. That is, income? is treated as an addi- 


tional explanatory variable. Hence, the quadratic model is a special case of a 
multivariate regression model. When fitting the model with 1m() we have to 
use the ^ operator in conjunction with the function IQ to add the quadratic 
term as an additional regressor to the argument formula. This is because the 
regression formula we pass to formula is converted to an object of the class 
formula. For objects of this class, the operators +, -, * and ^ have a nonarith- 
metic interpretation. I() ensures that they are used as arithmetical operators, 
see ?7I, 


# fit the quadratic Model 
quadratic_model <- lm(score ~ income + I(income*2), data = CASchools) 


# obtain the model summary 
coeftest (quadratic_model, vcov. = vcovHC, type = "HC1i") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 607.3017435  2.9017544 209.2878 < 2.2e-16 *** 

#> income 3.8509939 0.2680942 14.3643 < 2.2e-16 *** 

#> I(income*2) -0.0423084  0.0047803 -8.8505 < 2.2e-16 *** 

#> --- 

Hoi otgney. codes; O '*xx! 07001 Vee 0701 Ve! O70 O T 


The output tells us that the estimated regression function is 


TestScore; = 607.3 + 3.85 x income; — 0.0423 x income}. 
(2.90) (0.27) (0.0048) 


This model allows us to test the hypothesis that the relationship between test 


scores and district income is linear against the alternative that it is quadratic. 
This corresponds to testing 


Ho : 62 = 0 vs. A, : B2 £0, 
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since 62 = 0 corresponds to a simple linear equation and $82 4 O implies a 
quadratic relationship. We find that t = (82 — 0)/SE(62) = —0.0423/0.0048 = 
—8.81 so the null is rejected at any common level of significance and we conclude 
that the relationship is nonlinear. This is consistent with the impression gained 
from the plot. 


We now draw the same scatter plot as for the linear model and add the regres- 
sion line for the quadratic model. Because abline() can only draw straight 
lines, it cannot be used here. lines() is a function which allows to draw non- 
straight lines, see ?7lines. The most basic call of lines() is lines (x_values, 
y_values) where x_values and y_values are vectors of the same length that 
provide coordinates of the points to be sequentially connected by a line. This 
makes it necessary to sort the coordinate pairs according to the X-values. Here 
we use the function order () to sort the fitted values of score according to the 
observations of income. 


# draw a scatterplot of the observations for income and test score 
plot (CASchools$income, CASchools$score, 


col = "steelblue", 

pene 20. 

xlab = "District Income (thousands of dollars)", 

ylab = Test Scores 

main = "Estimated Linear and Quadratic Regression Functions") 


# add a linear function to the plot 
abline(linear_model, col = "black", lwd = 2) 


# add quatratic function to the plot 
order_id <- order(CASchools$income) 


lines(x = CASchools$income[order_id], 
y = fitted(quadratic_model) [order_id], 
col = "red", 
lwd = 2) 


8.2. NONLINEAR FUNCTIONS OF A SINGLE INDEPENDENT VARIABLE205 


Estimated Linear and Quadratic Regression Functions 
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We see that the quadratic function does fit the data much better than the linear 
function. 


8.2 Nonlinear Functions of a Single Independent 
Variable 


Polynomials 


The approach used to obtain a quadratic model can be generalized to polynomial 
models of arbitrary degree r, 


Y; = Bo + G1. Xi + BoX7 +--+ + bX] + ui. 


A cubic model for instance can be estimated in the same way as the quadratic 
model; we just have to use a polynomial of degree r = 3 in income. This is 
conveniently done using the function poly(). 


# estimate a cubic model 
cubic_model <- lm(score ~ poly(income, degree = 3, raw = TRUE), data = CASchools) 


poly() generates orthogonal polynomials which are orthogonal to the constant 
by default. Here, we set raw = TRUE such that raw polynomials are evaluated, 
see ?poly. 


In practice the question will arise which polynomial order should be chosen. 
First, similarly as for r = 2, we can test the null hypothesis that the true 
relation is linear against the alternative hypothesis that the relationship is a 
polynomial of degree r: 
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Ho : 62 =0, 63 =0,...,6,=0 vs. Hy: at least one 6; 40, j =2,...,7r 


This is a joint null hypothesis with r—1 restrictions so it can be tested using the 
F-test presented in Chapter 7. linearHypothesis() can be used to conduct 
such tests. For example, we may test the null of a linear model against the 
alternative of a polynomial of a maximal degree r = 3 as follows. 


# test the hypothesis of a linear model against quadratic or polynomial 
# alternatives 


# set up hypothesis matriz 
R <= rbind eCo One Or, 
cKO, ORO D9 


# do the test 

linearHypothesis(cubic_model, 
hypothesis.matrix = R, 
white.adj = "hc1") 

#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> poly(income, degree = 3, raw = TRUE)2 = 0 

#> poly(income, degree = 3, raw TRUE)3 = 0 

#> 

#> Model 1: restricted model 

#> Model 2: score ~ poly(income, degree = 3, raw = TRUE) 

#> 

#> Note: Coefficient covariance matrix supplied. 


#> 
#>  Res.Df Df F  PrOF) 

aE die 

#>2 416 2 37.691 9.043e-16 *** 

#> --- 

#> Signiy. codes: 0 ‘xxx! 0.001 “xx! 0.01 '*! 0.05" 0 T 


We provide a hypothesis matrix as the argument hypothesis.matrix. This is 
useful when the coefficients have long names, as is the case here due to using 
poly (), or when the restrictions include multiple coefficients. How the hypoth- 
esis matrix R is interpreted by linearHypothesis() is best seen using matrix 
algebra: 
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For the two linear constrains above, we have 


linearHypothesis() uses the zero vector for s by default, see ?7lLinearHypothesis. 


The p-value for is very small so that we reject the null hypothesis. However, 
this does not tell us which r to choose. In practice, one approach to determine 
the degree of the polynomial is to use sequential testing: 


in the regression equation. 


Estimate a polynomial model for some maximum value r. 
Use a t-test to test 8, = 0. Rejection of the null means that X” belongs 


Acceptance of the null in step 2 means that X” can be eliminated from the 


model. Continue by repeating step 1 with order r — 1 and test whether 
Br—1 = 0. If the test rejects, use a polynomial model of order r — 1. 


If the tests from step 3 rejects, continue with the procedure until the 
coefficient on the highest power is statistically significant. 


There is no unambiguous guideline how to choose r in step one. However, as 
pointed out in Stock and Watson (2015), economic data is often smooth such 
that it is appropriate to choose small orders like 2, 3, or 4. 


We will demonstrate how to apply sequential testing by the example of the cubic 
model. 


summary (cubic_model) 


#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 


lm(formula = score ~ poly(income, degree = 3, raw 


Call: 
Residuals: 

Min 1Q Median 
Shere earth ~  (Ole2X0) 
Coefficients: 
(Intercept) 
poly(income, degree 


poly(income, degree 


3Q 
8.32 


3, Taw 
3, Taw 


= TRUE), data = CASchools) 


Estimate Std. Error t value Pr(>/t/) 


Max 
31:16 
6.001e+02 
= TRUE)1 5.019e+00 
= TRUE)2 -9.581e-02 


5.830e+00 102.937 < 2e-16 
8.595e-01 5.839 1.06e-08 
3.736e-02 -2.564 0.0107 
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TRUE)3 6.855e-04 4.720e-04 1.452 0.1471 


#> poly(income, degree = 3, raw 
#> 

#> (Intercept) kk 
#> poly(income, degree = 3, raw = TRUE)1 *** 
#> poly(income, degree = 3, raw = TRUE)2 * 


#> poly(income, degree = 3, raw = TRUE)3 

2 = 

#> Stgntf. codes: (0) '***' 02001 tee 0101, (xe! O05 1 Ort! tad 
#> 

#> Residual standard error: 12.71 on 416 degrees of freedom 

#> Multiple R-squared: 0.5584, Adjusted R-squared: 0.5552 


#> F-statistic: 175.4 on 3 and 416 DF, p-value: < 2.2e-16 


The estimated cubic model stored in cubic_model is 


TestScore; = 600.1 + 5.02 x income — 0.96 x income? — 0.00069 x income. 
(5.83) (0.86) (0.03) (0.00047) 


The t-statistic on income? is 1.42 so the null that the relationship is quadratic 
cannot be rejected, even at the 10% level. This is contrary to the result presented 
book which reports robust standard errors throughout so we will also use robust 
variance-covariance estimation to reproduce these results. 


# test the hypothesis using robust standard errors 
coeftest (cubic_model, vcov. = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value 
#> (Intercept) 6.0008e+02 5.1021e+00 117.6150 


#> poly(income, degree = 3, raw = TRUE)1 5.0187e+00 7.0735e-01 7.0950 
#> poly(income, degree = 3, raw = TRUE)2 -9.5805e-02 2.8954e-02 -3.3089 
#> poly(income, degree = 3, raw = TRUE)3 6.8549e-04 3.4706e-04 1.9751 
#> Pr(>/t/) 

#> (Intercept) < 2.2e-16 *** 

#> poly(income, degree = 3, raw = TRUE)1 5.606e-12 *** 

#> poly(income, degree = 3, raw = TRUE)2 0.001018 ** 

#> poly(income, degree = 3, raw = TRUE)3 0.048918 * 

oo SS 

H signif. codes: iO '*x* 105001 Lee! 10500 Ue ORO5 Unt Id 


The reported standard errors have changed. Furthermore, the coefficient for 
income^3 is now significant at the 5% level. This means we reject the hypothesis 
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that the regression function is quadratic against the alternative that it is cubic. 
Furthermore, we can also test if the coefficients for income~2 and income73 are 
jointly significant using a robust version of the F-test. 


# perform robust F-test 
linearHypothesis(cubic_model, 
hypothesis.matrix = R, 
vcov. = vcovHC, type = "HC1i") 
#> Linear hypothesis test 
#> 
#> Hypothesis: 
#> poly(income, degree = 3, raw = TRUE)2 = 
#> poly(income, degree = 3, raw TRUE) 3 
#> 
#> Model 1: restricted model 
#> Model 2: score ~ poly(income, degree = 3, raw = TRUE) 
#> 
#> Note: Coefficient covariance matrix supplied. 


li 
OG 


#> 
#> Res.Df Df F PrF) 

1 218 

#> 2 416 2 29.678 8.945¢e-13 *** 

#> --- 

#> Signif. codes: 0 '*#*' 0.001 '+*' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


With a p-value of 9.043e716, i.e., much less than 0.05, the null hypothesis of 
linearity is rejected in favor of the alternative that the relationship is quadratic 
or cubic. 


Interpretation of Coefficients in Nonlinear Regression Models 


The coefficients in polynomial regression do not have a simple interpretation. 
Why? Think of a quadratic model: it is not helpful to think of the coefficient 
on X as the expected change in Y associated with a change in X holding the 
other regressors constant because X? changes as X varies. This is also the 
case for other deviations from linearity, for example in models where regressors 
and/or the dependent variable are log-transformed. A way to approach this is 
to calculate the estimated effect on Y associated with a change in X for one or 
more values of X. This idea is summarized in Key Concept 8.1. 
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Key Concept 8.1 
The Expected Effect on Y of a Change in X, in a Nonlinear 
Regression Model 


Consider the nonlinear population regression model 


= fa a oA a) A , EE 


where f(Xii, X2:,..., Xx) is the population regression function and u; 
is the error term. 

Denote by AY the expected change in Y associated with AX,, the 
change in X, while holding X2,--- ,X; constant. That is, the expected 
change in Y is the difference 


AY = f(X + AX, X2, Xk) — F(X, Xa,-°> , Xz). 


The estimator of this unknown population difference is the difference 
between the predicted values for these two cases. Let f (X1, X2,- , Xk) 
be the predicted value of of Y based on the estimator f of the population 
regression function. Then the predicted change in Y is 


AT = CG a a ay aA): 





For example, we may ask the following: what is the predicted change in test 
scores associated with a one unit change (i.e., $1000) in income, based on the 
estimated quadratic regression function 


TestScore = 607.3 + 3.85 x income — 0.0423 x income? ? 


Because the regression function is quadratic, this effect depends on the initial 
district income. We therefore consider two cases: 


1. An increase in district income form 10 to 11 (from $10000 per capita to 
$11000). 


2. An increase in district income from 40 to 41 (that is from $40000 to 
$41000). 


In order to obtain the AY associated with a change in income form 10 to 11, 
we use the following formula: 


AY = (Bo + Br x 11 + By x 11°) — (Bo + Br x 10 + Be x 10?) 


To compute Y using R we may use predict (). 
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# compute and assign the quadratic model 
quadriatic_model <- lm(score ~ income + I(income*2), data = CASchools) 


# set up data for prediction 
new_data <- data.frame(income = c(10, 11)) 


# do the prediction 
Y_hat <- predict(quadriatic_model, newdata = new_data) 


# compute the difference 
diff (Y_hat) 
#> 2 
#> 2.962517 


Analogously we can compute the effect of a change in district income from 40 
to 41: 


# set up data for prediction 
new_data <- data.frame(income = c(40, 41)) 


# do the prediction 
Y_hat <- predict(quadriatic_model, newdata = new_data) 


# compute the difference 
diff (Y_hat) 

#> 2 

#> 0.4240097 


So for the quadratic model, the expected change in TestScore induced by an 
increase in income from 10 to 11 is about 2.96 points but an increase in income 
from 40 to 41 increases the predicted score by only 0.42. Hence, the slope of the 
estimated quadratic regression function is steeper at low levels of income than 
at higher levels. 


Logarithms 


Another way to specify a nonlinear regression function is to use the natural log- 
arithm of Y and/or X. Logarithms convert changes in variables into percentage 
changes. This is convenient as many relationships are naturally expressed in 
terms of percentages. 


There are three different cases in which logarithms might be used. 


1. Transform X with its logarithm, but not Y. 
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2. Analogously we could transform Y to its logarithm but leave X at level. 


3. Both Y and X are transformed to their logarithms. 


The interpretation of the regression coefficients is different in each case. 


Case I: X is in Logarithm, Y is not. 


The regression model then is 
Y; = Bo + bı X ln(X;) + u; i= 1,...,n. 


Similar as for polynomial regression we do not have to create a new variable 
before using 1m(). We can simply adjust the formula argument of 1m() to tell 
R that the log-transformation of a variable should be used. 


# estimate a level-log model 
LinearLog_ model <- lm(score ~ log(income), data = CASchools) 


# compute robust summary 
coeftest (LinearLog model, 
vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 557.8323 3.8399 145.271 < 2.2e-16 *** 

#> log(income) 36.4197 1.3969 26.071 < 2.2e-16 *** 

= 

#> Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Hence, the estimated regression function is 


TestScore = 557.8 + 36.42 x In(éncome). 


Let us draw a plot of this function. 


# draw a scatterplot 
plot(score ~ income, 
col = "steelblue", 
pene—= 20); 
data = CASchools, 
main = "Linear-Log Regression Line") 
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# add the linear-log regression line 
order_id <- order(CASchools$income) 


lines (CASchools$income [order_id], 
fitted(LinearLog model) [order_id], 
col = "red", 
lwd = 2) 


Linear—Log Regression Line 
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We can interpret By as follows: a 1% increase in income is associated with an 
increase in test scores of 0.01 x 36.42 = 0.36 points. In order to get the estimated 
effect of a one unit change in income (that is, a change in the original units, 
thousands of dollars) on test scores, the method presented in Key Concept 8.1 
can be used. 


# set up new data 
new_data <- data.frame(income = c(10, 11, 40, 41)) 


# predict the outcomes 
Y_hat <- predict(LinearLog model, newdata = new_data) 


# compute the expected difference 

Y_hat_matrix <- matrix(Y_hat, nrow = 2, byrow = TRUE) 
Y_hat_matrix[, 2] - Y_hat_matrix[, 1] 

#> [1] 3.471166 0.899297 


By setting nrow = 2 and byrow = TRUE in matrix() we ensure that 
Y_hat_matrix is a 2 x 2 matrix filled row-wise with the entries of Y_hat. 


The estimated model states that for an income increase from $10000 to $11000, 
test scores increase by an expected amount of 3.47 points. When income in- 
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creases from $40000 to $41000, the expected increase in test scores is only about 
0.90 points. 


Case II: Y is in Logarithm, X is not 


There are cases where it is useful to regress In(Y). 


The corresponding regression model then is 


In(¥;) = Bo + Pi X Xi +u, t=1,...,n. 


# estimate a log-linear model 
LogLinear_model <- lm(log(score) ~ income, data = CASchools) 


# obtain a robust coefficient summary 
coeftest (LogLinear_model, 
vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 6.43936234 0.00289382 2225.210 < 2.2e-16 *** 

#> income 0.00284407 0.00017509 16.244 < 2.2e-16 *** 

#> --- 

#> stgntf.s codes: 0 Akk 07001 ee! 0701 "es! 0705) 2! 0.1 8 Vd 


The estimated regression function is 


In(TestScore) = 6.439 + 0.00284 x income. 


An increase in district income by $1000 is expected to increase test scores by 
100 x 0.00284% = 0.284%. 


When the dependent variable in logarithm, one cannot simply use e!°8) to 
transform predictions back to the original scale, see page of the book. 


Case ITI: X and Y are in Logarithms 


The log-log regression model is 


ln(Y;) = Bo + By x In(X;) + Ui; i= 1, seg Ma 
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# estimate the log-log model 
LogLog_model <- lm(log(score) ~ log(income), data = CASchools) 


# print robust coefficient summary to the console 
coeftest (LogLog_model, 
vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) 6.3363494 0.0059246 1069.501 < 2.2e-16 +** 

#> Log(income) 0.0554190 0.0021446 25.841 < 2.2e-16 *** 

#> --- 

#> stgntf. codes: Okr 07001 “**" 0.01 §*" 0705 9." O71 " " A 


The estimated regression function hence is 
In(TestScore) = 6.336 + 0.0554 x In(income). 


In a log-log model, a 1% change in X is associated with an estimated ĝ1% 
change in Y. 


We now reproduce Figure 8.5 of the book. 


# generate a scatterplot 
plot(log(score) ~ income, 
col = "steelblue", 
pen = 20; 
data = CASchools, 
"Log-Linear Regression Function") 


main 


# add the log-linear regression line 
order_id <- order(CASchools$income) 


lines (CASchools$income[order_id], 
fitted(LogLinear_model) [order_id], 
col = "red", 
lwd = 2) 


# add the log-log regression line 

lines (sort (CASchools$income) , 
fitted(LogLog model) [order (CASchools$income)] , 
cColm-RUcreenu, 
lwd = 2) 


# add a legend 


216 CHAPTER 8. NONLINEAR REGRESSION FUNCTIONS 


legend("bottomright", 
legend = c("log-linear model", "log-log model"), 
lwd = 2, 
col = c("red", "green")) 


Log-Linear Regression Function 
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Key Concept 8.2 summarizes the three logarithmic regression models. 


Key Concept 8.2 
Logarithms in Regression: Three Cases 


Model Specification Interpretation of 6 

Y; = Bo + bı ln(Xi) + ui A 1% change in X is associ- 
ated with a change in Y of 
0.01 x 61. 

ln(Y;) = bo + b1 Xi + ui A change in X by one unit 
(AX = 1) is associated with 


a 
100 x 81% change in Y. 

In(Y;) = bo + bı ln(X;) +u; A 1% change in X is associ- 
ated with a 861% change in Y, 
so 
Bı is the elasticity of Y with 
respect to X. 





Of course we can also estimate a polylog model like 


TestScore; = Bo + By x In(income;) + Be x In(income;)? + 83 x In(income;)? +ui 
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which models the dependent variable TestScore by a third-degree polynomial 
of the log-transformed regressor income. 


# estimate the polylog model 
polyLog_ model <- 1lm(score ~ log(income) + I(log(income)*2) + I(log(income)~3), 
data = CASchools) 


# print robust summary to the console 
coeftest (polyLog model, 
vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 486.1341 79.3825 6.1239 2.115e-09 *** 

#> Log (income) 14133820 87.8837 1.2901 OnUITT: 

#> I(log(income)“2) -26.9111 EP UA =O. O40 0.3971 

#> I(log(income)“3) 3.0632 3.7369 078197 0.4128 

| Ss 

Fo CGN. codes OFX" (OL OOU Eee! TOEOT Oe! NO OS EO a 


Comparing by R? we find that, leaving out the log-linear model, all models have 
a similar adjusted fit. In the class of polynomial models, the cubic specification 
has the highest R? whereas the linear-log specification is the best of the log- 
models. 


# compute the adj. R°2 for the nonlinear models 


adj_R2 <-rbind("quadratic" = summary(quadratic_model)$adj.r.squared, 
"cubic" = summary (cubic_model)$adj.r.squared, 
"LinearLog" = summary(LinearLog model) $adj.r.squared, 
"LogLinear" = summary (LogLinear_model) $adj.r.squared, 


"LogLog" = summary (LogLog model) $adj.r.squared, 
"polyLog" = summary (polyLog_model) $adj.r.squared) 


# assign column names 
colnames(adj_R2) <- "adj_R2" 


adj_R2 

#> adj_R2 
#> quadratic 0.5540444 
#> cubic 0. 5552279 


#> LinearLog 0.5614605 
#> LogLinear 0.4970106 
#> LogLog 0.5567251 
#> polyLog 0.5599944 
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Let us now compare the cubic and the linear-log model by plotting the corre- 
sponding estimated regression functions. 


# generate a scatterplot 
plot(score ~ income, 
data = CASchools, 


col = "steelblue", 
peh = 20; 
main = "Linear-Log and Cubic Regression Functions") 


# add the linear-log regression line 
order_id <- order(CASchools$income) 


lines (CASchools$income[order_id], 
fitted(LinearLog_ model) [order_id], 
col = "darkgreen", 
lwd = 2) 


# add the cubic regression line 

lines(x = CASchools$income[order_id], 
y = fitted(cubic_model) [order_id], 
col = "darkred", 
lwd = 2) 


Linear—Log and Cubic Regression Functions 
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Both regression lines look nearly identical. Altogether the linear-log model may 
be preferable since it is more parsimonious in terms of regressors: it does not 
include higher-degree polynomials. 
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8.3 Interactions Between Independent Vari- 
ables 


There are research questions where it is interesting to learn how the effect on 
Y of a change in an independent variable depends on the value of another 
independent variable. For example, we may ask if districts with many English 
learners benefit differentially from a decrease in class sizes to those with few 
English learning students. To assess this using a multiple regression model, we 
include an interaction term. We consider three cases: 

1. Interactions between two binary variables. 

2. Interactions between a binary and a continuous variable. 

3. Interactions between two continuous variables. 


The following subsections discuss these cases briefly and demonstrate how to 
perform such regressions in R. 


Interactions Between Two Binary Variables 


Take two binary variables Dı and D2 and the population regression model 


Yi = bo + b1 x Dii t+ Bo x Doi + ui. 


Now assume that 


Y; = In(Earnings;), 


z 1 if i*” person has a college degree, 
“0. else. 


1 if i” person is female, 
Dy = 


0 if i*” person is male. 


We know that 6, measures the average difference in In(Earnings) between 
individuals with and without a college degree and {2 is the gender differential 
in In(E£arnings), ceteris paribus. This model does not allow us to determine if 
there is a gender specific effect of having a college degree and, if so, how strong 
this effect is. It is easy to come up with a model specification that allows to 
investigate this: 


Y; = Bo + bı x Dii + b2 x Doi + B3 x (Dii x Dai) + ui 
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(Dii x Dzi) is called an interaction term and 83 measures the difference in the 
effect of having a college degree for women versus men. 


Key Concept 8.3 
A Method for Interpreting Coefficients in Regression with Bi- 
nary Variables 


Compute expected values of Y for each possible set described by the set 
of binary variables. Compare the expected values. The coefficients can 
be expressed either as expected values or as the difference between at 
least two expected values. 





Following Key Concept 8.3 we have 


E(Y;|Dy; = 0, Dai = d2) = Bo + By x 0 + b2 x d2 + B3 x (0 x də) 
= Bo + B2 x də. 





If Dı; switches from 0 to 1 we obtain 


E(Y;|Dy; = 1, Doi = d2) = Bo + B1 x 14+ Be x d2 + B3 x (1 x da) 
= Bo + Bi + b2 X d2 + B3 X də. 





Hence, the overall effect is 


E(Y;|Du; = 1, Da = d2) — E(Y;| Dii = 0, Doi = d2) = b1 + b3 X d2 


so the effect is a difference of expected values. 


Application to the Student-Teacher Ratio and the Percentage of En- 
glish Learners 


Now let 


1, if STR> 20 
, else, 


1, if PctEL > 10 
0, else. 


8.3. INTERACTIONS BETWEEN INDEPENDENT VARIABLES 221 


We may use R to construct the variables above as follows. 


# append HiSTR to CASchools 
CASchools$HiSTR <- as.numeric(CASchools$size >= 20) 


# append HiEL to CASchools 
CASchools$HiEL <- as.numeric(CASchools$english >= 10) 


We proceed by estimating the model 


TestScore = bo + bı x HiSTR+ b2 x HiEL + p3 x (HiSTR x HiEL) + ui. 
(8.1) 


There are several ways to add the interaction term to the formula argument 
when using 1m() but the most intuitive way is to use HIEL * HiSTR.! 


# estimate the model with a binary interaction term 
bi_model <- lm(score ~ HiSTR * HiEL, data = CASchools) 


# print a robust summary of the coefficients 
coeftest (bi_model, vcov. = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 664.1433 1.3881 478.4589 < 2.2e-16 *** 

#> Hi STR -1.9078 1.9322 -0.9874 0.3240 

#> HiEL -18.3155 2.3340 -7.8412 3.634e-14 **x 

#> HiSTR:HiEL =o). 2601 3.1189 -1.0453 0.2965 

pe 

ae ooe codes; O e OO e O E O a 0.4)" * at 


The estimated regression model is 


TestScore = 664.1— 1.9 x HiSTR— 18.3 x HiEL— 3.3 x(HiSTRx HiEL) 
(1.39) (1.93) (2.33) (3.12) 


and it predicts that the effect of moving from a school district with a low student- 
teacher ratio to a district with a high student-teacher ratio, depending on high 
or low percentage of english learners is —1.9—3.3x Hi EL. So for districts with a 
low share of english learners (HiEL = 0), the estimated effect is a decrease of 1.9 
points in test scores while for districts with a large fraction of English learners 





1Appending HiEL * HiSTR to the formula will add HiEL, HiSTR and their interaction as 
regressors while HiEL:HiSTR only adds the interaction term. 
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(HiEL = 1), the predicted decrease in test scores amounts to 1.9 + 3.3 = 5.2 


points. 


We can also use the model to estimate the mean test 


combination of the included binary variables. 


# estimate means for all combinations of HiSTR 


# 1. 

predict (bi_model, newdata 
#> al 

#> 664.1433 


# 2. 

predict (bi_model, newdata 
#> a 

#> 645.8278 


# 3. 

predict(bi_model, newdata 
#> 1 

#> 662.2354 


#4. 
predict(bi_model, newdata 
#> 1 
#> 640.6598 


data 


data 


data 


data 


.frame("HiSTR" = 0, 


.frame("HiSTR" = 0, 


-frame("HiSTR" = 1, 


.frame("HiSTR" = 1, 


score for each 


and HiEL 


possible 
WEEE = 0) 
"HIEL = 19) 
"HiEL" = 0)) 
ys Gre) HE = ))) 


We now verify that these predictions are differences in the coefficient estimates 


presented in equation (8.1): 


TestScore = By = 664.1 
TestScore = Êo + Bo = 664.1 — 18.3 = 645.8 

TestScore = By + Êi = 664.1 — 1.9 = 662.2 

TestScore = By + Êi + Bo + B3 = 664.1 — 1.9 — 18.3 — 3.3 = 640.6 


ttt 


MiSTR=0, HIEL =0 
MiSTR=0, HIEL =1 
HiSTR=1, HIEL =0 
HiSTR=1, HIEL =1 
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Interactions Between a Continuous and a Binary Variable 


Now let X; denote the years of working experience of person i, which is a 
continuous variable. We have 


Y; = In(Earnings;), 
X; = working experience of person i, 


o Jl, if it’ person has a college degree 
‘ )0, else. 


The baseline model thus is 





Y; = bo + 1X; + b2Di + wi, 


a multiple regression model that allows us to estimate the average benefit of 
having a college degree holding working experience constant as well as the aver- 
age effect on earnings of a change in working experience holding college degree 
constant. 


By adding the interaction term X; x D; we allow the effect of an additional 
year of work experience to differ between individuals with and without college 
degree, 


Y; = Bo + b1 Xi + BoD; + 63(Xi x Di) + ui. 


Here, (3 is the expected difference in the effect of an additional year of work 
experience for college graduates versus non-graduates. Another possible speci- 
fication is 


Y; = Bo + Bi Xi + Bo(X; x Dj) + ui. 


This model states that the expected impact of an additional year of work ex- 
perience on earnings differs for college graduates and non-graduates but that 
graduating on its own does not increase earnings. 


All three regression functions can be visualized by straight lines. Key Concept 
8.4 summarizes the differences. 
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Key Concept 8.4 
Interactions Between Binary and Continuous Variables 


An interaction term like X; x D; (where X; is continuous and D; is 
binary) allows for the slope to depend on the binary variable D;. There 
are three possibilities: 

1. Different intercept and same slope: 


Wa = (bly te Bike +E EoD + oe 


2. Different intercept and different slope: 


Y; = Bo + 1X: + BoD; + b3 x (Xi x Di) + ui 


3. Same intercept and different slope: 


Y= oer ol ea, 





The following code chunk demonstrates how to replicate the results shown in 
Figure 8.8 of the book using artificial data. 


# generate artificial data 
set.seed(1) 


X << runif (200.0. 15) 
D <- sample(0:1, 200, replace = T) 
Y <- 450 + 150 * X + 500 * D + 50 * (X * D) + rnorm(200, sd = 300) 


# divide plotting area accordingly 
m <- rbind(c(1, 2), c(3, 0)) 
graphics: : layout (m) 


# estimate the models and plot the regression lines 


#1. (baseline model) 
plot(X, log(Y), 


pen = 20; 
col = "steelblue", 
main = "Different Intercepts, Same Slope") 


modi_coef <- lm(log(Y) ~ X + D)$coefficients 


abline(coef = c(modi_coef[1], mod1_coef[2]), 
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col 
lwd 


"red" > 
PE) 


abline(coef = c(mod1_coef[1] + mod1_coef [3], mod1_coef[2]), 
col = Wgreen, 
lwd = 1.5) 


#2. (vaseline model + interaction term) 
plot(X, log(Y), 


pen =) 20; 
col = "steelblue", 
main = "Different Intercepts, Different Slopes") 


mod2_coef <- lm(log(Y) ~ X + D + X:D)$coefficients 


abline(coef = c(mod2_coef[1], mod2_coef[2]), 
col = "red", 
lwd = 1.5) 


abline(coef = c(mod2_coef[1] + mod2_coef [3], mod2_coef[2] + mod2_coef[4]), 
col, = “green, 
lwd = 1.5) 


# 3. (omission of D as regressor + interaction term) 
plot(X, log(yY), 


Deche—=220), 
col = "steelblue", 
main = "Same Intercept, Different Slopes") 


mod3_coef <- lm(log(Y) ~ X + X:D)$coefficients 


abline(coef = c(mod3_coef[1], mod3_coef[2]), 
col = "red", 
lwd = 1.5) 


abline(coef = c(mod3_coef[1], mod3_coef[2] + mod3_coef[3]), 
cole —SUereemur 
lwd = 1.5) 
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Application to the Student-Teacher Ratio and the Percentage of En- 
glish Learners 


Using a model specification like the second one discussed in Key Concept 8.3 
(different slope, different intercept) we may answer the question whether the 
effect on test scores of decreasing the student-teacher ratio depends on whether 
there are many or few English learners. We estimate the regression model 


TestScore; = bo + bı x size; + B2 x HiEL; + Bo(size; x HiEL;) + ui. 


# estimate the model 
bci_model <- lm(score ~ size + HiEL + size * HiEL, data = CASchools) 


# print robust summary of coefficients to the console 
coeftest (bci_model, vcov. = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 682.24584 11.86781 57.4871 <2e-16 *** 

#> size -0. 96846 0.58910 -1.6440 0.1009 

#> HiEL 5.63914 19.51456 0.2890 OTIT 

#> size:HiEL IR 2601 0.96692 -1.3203 0-1875 

#> --- 

#> oignij- codes- O “xxx! OTOOTO OI WMO OS EOI Id 


The estimated regression model is 


TestScore = 682.2 — 0.97 x size+ 5.6 x HiEL — 1.28 x (size x HiEL). 
(11.87) (0.59) (19.51) (0.97) 
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The estimated regression line for districts with a low fraction of English learners 
(HiEL; = 0) is 
TestScore = 682.2 — 0.97 x size. 


For districts with a high fraction of English learners we have 


TestScore = 682.2 + 5.6 — 0.97 x size; — 1.28 x size; 
= 687.8 — 2.25 x size;. 


The predicted increase in test scores following a reduction of the student-teacher 
ratio by 1 unit is about 0.97 points in districts where the fraction of English 
learners is low but 2.25 in districts with a high share of English learners. From 
the coefficient on the interaction term size x HiEL we see that the difference 
between both effects is 1.28 points. 


The next code chunk draws both lines belonging to the model. In order to make 
observations with HiEL = 0 distinguishable from those with HiEL = 1, we 
use different colors. 


# identify observations with PctEL >= 10 
id <- CASchools$english >= 10 


# plot observations with HiEL = 0 as red dots 
plot (CASchools$size[!id], CASchools$score[!id], 
xlim = c(0, 27), 
ylim = c(600, 720), 


pen = 20; 

col = "red", 

main = "" p 

xlab = "Class Size", 


ylab = "Test Score") 


# plot observations with HiEL = 1 as green dots 
points (CASchools$size[id] , CASchools$score[id] , 
pehe=2205 
col = "green") 


# read out estimated coefficients of bci_model 
coefs <- bci_model$coefficients 


# draw the estimated regression line for HiEL = 0 
abline(coef = c(coefs[1], coefs[2]), 

col = "red", 

lwd = 1.5) 
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# draw the estimated regression line for HiEL = 1 
abline(coef = c(coefs[1] + coefs[3], coefs[2] + coefs[4]), 
coll =) green, 
lwd = 1.5 ) 


# add a legend to the plot 
legend("topright", 

pch = c(20, 20), 

col c("red", "green"), 

legend = c("HiEL = 0", "HiEL = 1")) 
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Interactions Between Two Continuous Variables 


Consider a regression model with Y the log earnings and two continuous re- 
gressors X4, the years of work experience, and Xo, the years of schooling. We 
want to estimate the effect on wages of an additional year of work experience 
depending on a given level of schooling. This effect can be assessed by including 
the interaction term (X1; x X2;) in the model: 


AY; = bo + b1 X Xii + Bo x Xai + b3 X (Xii X Xai) + ui 


Following Key Concept 8.1 we find that the effect on Y of a change on X; given 
Xə is 
AY 


ax bı + 83X2. 


In the earnings example, a positive (3 implies that the effect on log earnings 


of an additional year of work experience grows linearly with years of schooling. 


Vice versa we have AF 
= X 
AX, Bo + B3X1 
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as the effect on log earnings of an additional year of schooling holding work 
experience constant. 


Altogether we find that 63 measures the effect of a unit increase in X; and X2 
beyond the effects of increasing X; alone and Xə alone by one unit. The overall 
change in Y is thus 


Y; = (81 + 63X2)AXy + (b2 + 63X1)AX2 + 83A X14 X3. (8.2) 


Key Concept 8.5 summarizes interactions between two regressors in multiple 
regression. 


Key Concept 8.5 
Interactions in Multiple Regression 


The interaction term between the two regressors X; and Xə is given by 
their product Xı x Xj. Adding this interaction term as a regressor to 
the model 

Wy = (by Fe ki FE Bip +h as 


allows the effect on Y of a change in Xə to depend on the value of Xj 
and vice versa. Thus the coefficient 63 in the model 


Y; = Bo + b1 Xı + BoX2 + B3(X1 x X2) + ui 


measures the effect of a one-unit increase in both Xı and Xə above and 
beyond the sum of both individual effects. This holds for continuous and 
binary regressors. 





8.3.0.1 Application to the Student-Teacher Ratio and the Percent- 
age of English Learners 


We now examine the interaction between the continuous variables student- 
teacher ratio and the percentage of English learners. 


# estimate regression model including the interaction between 'PctEL' and 'size' 
cci_model <- lm(score ~ size + english + english * size, data = CASchools) 


# print a summary to the console 

coeftest(cci_model, vcov. = vcovHC, type = "HCi") 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 
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#> (Intercept) 686.3385268 11.7593466 58.3654 < 2e-16 *** 

#> size -1.1170184 0.5875136 -1.9013 0.05796 . 

#> english -0.6729119 0.3741231 -1.7986 0.07280 . 

#> size:english 0.0011618 0.0185357 0.0627 0.95005 

#> --- 

wo bony. codes Oky (OUOOI Ie (O01 #0805) Ont | a 


The estimated model equation is 


TestScore = 686.3 — 1.12 x STR — 0.67 x PctEL +0.0012 x (STR x PctEL). 
(11.76) (0.59) (0.37) (0.02) 


For the interpretation, let us consider the quartiles of PctEL. 


summary (CASchools$english) 
#> Min. 1st Qu. Median Mean 3rd Qu. Maz. 
#> 0.000 17971 8.778 15.768 22.970 85.540 


According to (8.2), if PctEL is at its median value of 8.78, the slope of the 
regression function relating test scores and the student teacher ratio is predicted 
to be —1.12 + 0.0012 x 8.78 = —1.11. This means that increasing the student- 
teacher ratio by one unit is expected to deteriorate test scores by 1.11 points. 
For the 75% quantile, the estimated change on TestScore of a one-unit increase 
in STR is estimated by —1.12+ 0.0012 x 23.0 = —1.09 so the slope is somewhat 
lower. The interpretation is that for a school district with a share of 23% English 
learners, a reduction of the student-teacher ratio by one unit is expected to 
increase test scores by only 1.09 points. 


However, the output of summary () indicates that the difference of the effect for 
the median and the 75% quantile is not statistically significant. Ho : 83 = 0 
cannot be rejected at the 5% level of significance (the p-value is 0.95). 


Example: The Demand for Economic Journals 


In this section we replicate the empirical example The Demand for Economic 
Journals presented at pages 336 - 337 of the book. The central question is: 
how elastic is the demand by libraries for economic journals? The idea here 
is to analyze the relationship between the number of subscription to a journal 
at U.S. libraries and the journal’s subscription price. The study uses the data 
set Journals which is provided with the AER package and contains observations 
for 180 economic journals for the year 2000. You can use the help function 
(?Journals) to get more information on the data after loading the package. 
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# load package and the data set 
library (AER) 
data("Journals") 


We measure the price as “price per citation” and compute journal age and the 
number of characters manually. For consistency with the book we also rename 
the variables. 


# define and rename variables 

Journals$PricePerCitation <- Journals$price/Journals$citations 
Journals$Age <- 2000 - Journals$foundingyear 
Journals$Characters <- Journals$charpp * Journals$pages/10°6 
Journals$Subscriptions <- Journals$subs 


The range of “price per citation” is quite large: 


# compute summary statistics for price per citation 

summary (Journals$PricePerCitation) 

#> Min. Tst Qu. Median Mean Sra Gur Maz. 
#> 0.005223 0.464495 1.320513 2.548455 3.440171 24.459459 


The lowest price observed is a mere 0.5¢ per citation while the highest price is 
more than 20¢ per citation. 


We now estimate four different model specifications. All models are log-log 
models. This is useful because it allows us to directly interpret the coefficients 
as elasticities, see Key Concept 8.2. (I) is a linear model. To alleviate a pos- 
sible omitted variable bias, (JI) augments (I) by the covariates In(Age) and 
In(Characters). The largest model (JIT) attempts to capture nonlinearities in 
the relationship of In(Subscriptions) and In(PricePerCitation) using a cubic 
regression function of In(PricePerCitation) and also adds the interaction term 
(PricePerCitation x Age) while specification (IV) does not include the cubic 
term. 


(I) In(Subscriptions;) = Bo + 8, In(PricePerCitation;) + ui 


(II) \n(Subscriptions;) = Bo + 51 In(PricePerCitation;) + 64 ln(Age;) + 86 In(Characters;) + ui 








(III) 1n(Subscriptions;) = bo + Bı n(PricePerCitation;) + B2 n(PricePerCitation;)? 
+ 63 ln(PricePerCitation;)® + B4ln(Age;) + Bs [In(Age;) x In(PricePerCitation;)| 
+ Be In(Characters;) + ui 





(IV) In(Subscriptions;) = Bo + Bı In(PricePerCitation;) + B4ln(Age;) + Be In(Characters;) + ui 
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# Estimate models (I) - (IV) 
Journals_mod1 <- lm(log(Subscriptions) ~ log(PricePerCitation) , 
data = Journals) 


Journals_mod2 <- 1lm(log(Subscriptions) ~ log(PricePerCitation) 
+ log(Age) + log(Characters) , 
data = Journals) 


Journals_mod3 <- 1lm(log(Subscriptions) ~ 
log(PricePerCitation) + I(log(PricePerCitation) ~2) 
+ I(log(PricePerCitation)~3) + log(Age) 
+ log(Age):log(PricePerCitation) + log(Characters), 
data = Journals) 


Journals_mod4 <- 1lm(log(Subscriptions) ~ 
log(PricePerCitation) + log(Age) 
+ log(Age):log(PricePerCitation) + 


log(Characters) , 
data = Journals) 


Using summary (), we obtain the following estimated models: 


(I) In(Subscriptions;) = 4.77 — 0.53 In(PricePerCitation;) 


(II) In(Subscriptions;) = 3.21 — 0.41 In(PricePerCitation;) + 0.42 In(Age;) + 0.21 In(Characters;) 











(IIT) In(Subscriptions;) = 3.41 — 0.96 In(PricePerCitation;) + 0.02 In(PricePerCitation;)* 
+ 0.004 In(PricePerCitation;)? + 0.37 In(Age;) 

+ 0.16 [In(Age;) x In(PricePerCitation;) 
+ 0.23 In(Character s;) 








(IV) In(Subscriptions;) =3.43 — 0.90 In(PricePerCitation;) + 0.37 In(Age;) 
+ 0.14 [In(Age;) x In(PricePerCitation;)] + 0.23 ln(Characters;) 


We use an F-Test to test if the transformations of In(PricePerCitation) in 
Model (IIT) are statistically significant. 


# FE Test for stgnificance of cubtc terms 
linearHypothesis (Journals_mod3, 
c("I(log(PricePerCitation)~2)=0", "I(log(PricePerCitation)~3)=0"), 
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vcov. = vcovHC, type = "HC1i") 
#> Linear hypothesis test 
#> 
#> Hypothesis: 
#> I(log(PricePerCitation)“2) = 0 
#> I(log(PricePerCitation) 3) = 0 
#> 
#> Model 1: restricted model 
#> Model 2: log(Subscriptions) ~ log(PricePerCitation) + I(log(PricePerCitation) “2) + 


#> I(log(PricePerCitation) “3) + log(Age) + log(Age):log(PricePerCitation) + 
#> log (Characters) 

#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df FE PrF) 

#> 1 175 


#> 2 173 2 0.1943 0.8236 


Clearly, we cannot reject the null hypothesis Ho : 63 = 84 = 0 in model (TIT). 


We now demonstrate how the function stargazer() can be used to generate a 
tabular representation of all four models. 


# load the stargazer package 
library (stargazer) 


# gather robust standard errors in a list 

rob_se <- list(sqrt(diag(vcovHC(Journals_modi, type = "HCi"))), 
sqrt (diag(vcovHC(Journals_mod2, type = "HC1"))), 
sqrt (diag (vcovHC(Journals_mod3, type = "HCi"))), 
sqrt (diag(vcovHC(Journals_mod4, type = "HC1")))) 


# generate a Latex table using stargazer 

stargazer(Journals_mod1, Journals_mod2, Journals_mod3, Journals_mod4, 
se = rob_se, 
digits = 3, 
colunn. labels = nc CUGh) Wy "(II)", a Gleiae yi "(IV)" 


The subsequent code chunk reproduces Figure 8.9 of the book. 
# divide plotting area 

m <- rbind(c(1, 2), c(3, 0)) 

graphics: : layout (m) 


# scatterplot 
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Table 8.1: 


Nonlinear Regression Models of Journal Subscribtions 








Dependent Variable: Logarithm of Subscriptions 














log(Subscriptions) 
(1) D C) (IV) 
log(PricePerCitation) —0.533*** —0.408*** —0.961*** —0.899*** 
(0.034) (0.044) (0.160) (0.145) 
I(log(PricePerCitation)*2) 0.017 
(0.025) 
I(log(PricePerCitation)*3) 0.004 
(0.006) 
log(Age) 0.424*** 0.373*** 0.374*** 
(0.119) (0.118) (0.118) 
log(Characters) 0.206** 0.235** 0.229** 
(0.098) (0.098) (0.096) 
log(PricePerCitation):log(Age) 0.156*** 0.141*** 
(0.052) (0.040) 
Constant 4.766*** 3.207*** 3.408*** 3.434*** 
(0.055) (0.380) (0.374) (0.367) 
Observations 180 180 180 180 
R? 0.557 0.613 0.635 0.634 
Adjusted R? 0.555 0.607 0.622 0.626 


Residual Std. Error 
F Statistic 


0.750 (df = 178) 
224.037*** (df = 1; 178) 


0.705 (df = 176) 
93.009*** (df = 3; 176) 


0.691 (df = 173) 
50.149*** (df = 6; 173) 


0.688 (df = 175) 
75.749*** (df = 4; 175) 








Note: 


*p<0.1; **p<0.05; ***p<0.01 
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plot (Journals$PricePerCitation, 
Journals$Subscriptions, 


pene=—20); 

col = "steelblue", 

ylab = "Subscriptions", 

xlab = "In(Price per ciation)", 
main = "(a)") 


# log-log scatterplot and estimated regression line (I) 
plot (log(Journals$PricePerCitation) , 
log(Journals$Subscriptions) , 


pehi =O; 

col = "steelblue", 

ylab = “In (Subscriptions), 
xlab = "In(Price per ciation)", 


main = "(b)") 


abline(Journals_mod1l, 
lwd = 1.5) 


# log-log scatterplot and regression lines (IV) for Age = 5 and Age = 80 
plot (log(Journals$PricePerCitation) , 

log(Journals$Subscriptions) , 

peni=i20; 

col = "steelblue", 

ylab = "1ln(Subscriptions)", 

xlab = "In(Price per ciation)", 

main = "(c)") 


JM4C <-Journals_mod4$coefficients 


# Age = 80 
abline(coef = c(JM4C[1] + JM4C[3] * log(80), 
JM4C[2] + JM4C[5] * log(80)), 


col = "darkred", 
iva =). 5) 
# Age = 5 
abline(coef = c(JM4C[1] + JM4C[3] * log(5), 
JM4c[2] + JM4C[5] * log(5)), 
col = “darkgreen!"; 
iud = 1S) 
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As can be seen from plots (a) and (b), the relation between subscriptions and 
the citation price is adverse and nonlinear. Log-transforming both variables 
makes it approximately linear. Plot (c) shows that the price elasticity of journal 
subscriptions depends on the journal’s age: the red line shows the estimated 
relationship for Age = 80 while the green line represents the prediction from 
model (IV) for Age = 5. 


Which conclusion can be drawn? 


1. We conclude that the demand for journals is more elastic for young jour- 
nals than for old journals. 


2. For model (JII) we cannot reject the null hypothesis that the coefficients 
on In(PricePerCitation)? and In(PricePerCitation)? are both zero using 
an F-test. This is evidence compatible with a linear relation between log- 
subscriptions and log-price. 


3. Demand is greater for Journals with more characters, holding price and 
age constant. 


Altogether our estimates suggest that the demand is very inelastic, i.e., the 
libraries’ demand for economic journals is quite insensitive to the price: using 
model (IV), even for a young journal (Age = 5) we estimate the price elasticity 
to be —0.899 + 0.374 x In(5) + 0.141 x [In(1) x In(5)] ~ —0.3 so a one percent 
increase in price is predicted to reduce the demand by only 0.3 percent. 


This finding comes at no surprise since providing the most recent publications 
is a necessity for libraries. 
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8.4 Nonlinear Effects on Test Scores of the 
Student-Teacher Ratio 


In this section we will discuss three specific questions about the relationship 
between test scores and the student-teacher ratio: 


1. Does the effect on test scores of decreasing the student-teacher ratio de- 
pend on the fraction of English learners when we control for economic 
idiosyncrasies of the different districts? 


2. Does this effect depend on the the student-teacher ratio? 


3. How strong is the effect of decreasing the student-teacher ratio (by two 
students per teacher) if we take into account economic characteristics and 
nonlinearities? 


Too answer these questions we consider a total of seven models, some of which 
are nonlinear regression specifications of the types that have been discussed 
before. As measures for the students’ economic backgrounds, we additionally 
consider the regressors lunch and In(income). We use the logarithm of income 
because the analysis in Chapter 8.2 showed that the nonlinear relationship be- 
tween income and TestScores is approximately logarithmic. We do not include 
expenditure per pupil (expenditure) because doing so would imply that expen- 
diture varies with the student-teacher ratio (see Chapter 7.2 of the book for a 
detailed argument). 


Nonlinear Regression Models of Test Scores 


The considered model specifications are: 
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TestScore; =bo + Bi size; + Bsenglish; + Bolunch; + ui (8.3) 

TestScore; =Bo + 8, size; + Baenglishi + Bolunch; + B10 In(income;) + ui 
(8.4) 

TestScore; =Bo + Bi size; + bs HiEL; + be(HiEL; x size;) + ui (8.5) 

TestScore; =B9 + 8, size; + 6s HiEL; + Bg(HiEL,; x size;) + Bglunch; + B19 M(income;) + ui 
(8.6) 

TestScore; =o + Bi size; + Bosize? + bs HiEL; + Bolunch; + B19 In(income;) + ui 
(8.7) 

TestScore; =Bo + Bi size; + Bosize? + B3size? + BsHiEL; + Bg(HiEL x size) 
(8.8) 

+ Br(HiEL; x size?) + Bg(HiEL; x size?) + Bglunch; + Bio In(income;) + ui 

(8.9) 

TestScore; =Bo + by size; + Bosize? + Bgsize? + Baenglish + Bglunch; + B19 (income) + u; 

(8.10) 


# estimate all models 
TestScore_modi <- lm(score ~ size + english + lunch, data = CASchools) 


TestScore_mod2 <- lm(score ~ size + english + lunch + log(income), data = CASchools) 
TestScore_mod3 <- lm(score ~ size + HiEL + HiEL:size, data = CASchools) 
TestScore_mod4 <- lm(score ~ size + HiEL + HiEL:size + lunch + log(income), data = CAS 


TestScore_mod5 <- lm(score ~ size + I(size*2) + I(size*3) + HiEL + lunch + log(income) 
data = CASchools) 


TestScore_mod6 <- lm(score ~ size + I(size°2) + I(size°3) + HiEL + HiEL:size + HiEL:I(; 
HiEL:I(size*3) + lunch + log(income), data = CASchools) 


TestScore_mod7 <- lm(score ~ size + I(size*2) + I(size*3) + english + lunch + log(inco! 
data = CASchool1s) 


We may use summary() to assess the models’ fit. Using stargazer() we may 
also obtain a tabular representation of all regression outputs and which is more 
convenient for comparison of the models. 


# gather robust standard errors in a list 


rob_se <- list(sqrt (diag (vcovHC(TestScore_mod1, type = "HCi"))), 
sqrt (diag (vcovHC(TestScore_mod2, type = "HCi"))), 
sqrt (diag (vcovHC (TestScore_mod3, type = "HC1i"))), 


8.4. NONLINEAR EFFECTS ON TEST SCORES OF THE STUDENT-TEACHER RATIO239 


sqrt (diag (vcovHC(TestScore_mod4, type = "HC1i"))), 
sqrt (diag (vcovHC(TestScore_mod5, type = "HC1i"))), 
sqrt (diag (vcovHC(TestScore_mod6, type = "HCi"))), 
sqrt (diag (vcovHC(TestScore_mod7, type = "HC1i")))) 


# generate a LaTeX table of regression outputs 
stargazer (TestScore_mod1, 
TestScore_mod2, 
TestScore_mod3, 
TestScore_mod4, 
TestScore_mod5, 
TestScore_mod6, 
TestScore_mod7, 
digits = 3, 
dep.var.caption = "Dependent Variable: Test Score", 
se = rob_se, 
columnellabeliss= ic CUCI) ta) CS) (A) ECE RCE) " (7)")) 


Let us summarize what can be concluded from the results presented in Table 
8.2. 


First of all, the coefficient on size is statistically significant in all seven models. 
Adding In(income) to model (1) we find that the corresponding coefficient is 
statistically significant at 1% while all other coefficients remain at their signif- 
icance level. Furthermore, the estimate for the coefficient on size is roughly 
0.27 points larger, which may be a sign of attenuated omitted variable bias. We 
consider this a reason to include In(income) as a regressor in other models, too. 


Regressions (3) and (4) aim to assess the effect of allowing for an interaction 
between size and HiEL, without and with economic control variables. In both 
models, both the coefficient on the interaction term and the coefficient on the 
dummy are not statistically significant. Thus, even with economic controls we 
cannot reject the null hypotheses, that the effect of the student-teacher ratio 
on test scores is the same for districts with high and districts with low share of 
English learning students. 


Regression (5) includes a cubic term for the student-teacher ratio and omits 
the interaction between size and HiEl. The results indicate that there is a 
nonlinear effect of the student-teacher ratio on test scores (Can you verify this 
using an F-test of Ho : 82 = 83 = 0?) 


Consequently, regression (6) further explores whether the fraction of English 
learners impacts the student-teacher ratio by using HiEL x size and the in- 
teractions HiEL x size? and HiEL x size®. All individual t-tests indicate 
that that there are significant effects. We check this using a robust F-test of 
Ho : Be = 87 = Bg = 0. 
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Table 8.2: 


Nonlinear Models of Test Scores 








Dependent Variable: Test Score 





score 








(1) (2) (3) (4) (5) (6) (7) 
size —0.998"* —0.734*** —0.968 —0.531 64.339"  83.702*** 65.285" 
(0.270) (0.257) (0.589) (0.342) (24.861) (28.497) (25.259) 
english 1 = —0.176*** —0.166*** 
(0.033) (0.034) (0.034) 
I(size*2) Bana  —4.381*** — —3.466*** 
(1.250) (1.441) (1.271) 
I(size^3) 0.059*** 0.075*** 0.060*** 
(0.021) (0.024) (0.021) 
lunch —0.547*** — —0.398*** -0.411*  —0.420**  -0.418***  —0.402*** 
(0.024) (0.033) (0.029) (0.029) (0.029) (0.033) 
log(income) 11.569*** ina" 11.748  11.800** 11.09% 
(1.819) (1.798) (1.771) (1.778) (1.806) 
HiEL 5.639 5.498 —5.474***  816.076** 
(19.515) (9.795) (1.034) (327.674) 
size:HiEL —1.277 —0.578 —123.282** 
(0.967) (0.496) (50.213) 
I(size*2):HiEL 6.121** 
(2.542) 
I(size*3):HiEL —0.101** 
(0.043) 
Constant 700.150°** 658.552*** 6 82.246*** 653.666*** 252.050 122.353 244.809 
(5.568) (8.642) (11.868) (9.869) (163.634) (185.519) (165.722) 
Observations 420 420 420 420 420 420 420 
R? 0.775 0.796 0.310 0.797 0.801 0.803 0.801 
Adjusted R? 0.773 0.794 0.305 0.795 0.798 0.799 0.798 








Note: 


*p<0.1; **p<0.05; ***p<0.01 
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# check joint significance of the interaction terms 

linearHypothesis(TestScore_mod6, 
c("size:HiEL=0", "I(size*2):HiEL=0", "I(size*3) :HiEL=0"), 
vcov. = vcovHC, type = "HC1i") 

#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> size:HiEL = 0 

#> I(size“2):HiEL = 0 

#> I(size“3):HiEL = 0 

#> 

#> Model 1: restricted model 

#> Model 2: score ~ size + I(size°2) + I(size3) + HiEL + HiEL:size + HiEL:I(size“2) + 


#> HiEL:I(size~3) + lunch + log (income) 

#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df F Pr(CF) 

#> 1 413 

#> 2 410 3 2.1885 0.08882 . 

7 o 

H> Signi codes: (0 “xxx! (02001 Ie" OR01 “eI OLO5 “S" Olt Yi a 


We find that the null can be rejected at the level of 5% and conclude that the 
regression function differs for districts with high and low percentage of English 
learners. 


Specification (7) uses a continuous measure for the share of English learners 
instead of a dummy variable (and thus does not include interaction terms). We 
observe only small changes to the coefficient estimates on the other regressors 
and thus conclude that the results observed for specification (5) are not sensitive 
to the way the percentage of English learners is measured. 


We continue by reproducing Figure 8.10 of the book for interpretation of the 
nonlinear specifications (2), (5) and (7). 


# scatterplot 
plot (CASchools$size, 
CASchools$score, 
xlim = c(12, 28), 
ylim = c(600, 740), 
pent- 207, 
Colles leaves. 
xlab = "Student-Teacher Ratio", 
ylab = "Test Score") 
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# add a legend 
legend("top", 
legend = c("Linear Regression (2)", 
"Cubic Regression (5)", 
"Cubic Regression (7)"), 
cex = 0.8, 
neol = 3, 
hey = Cl aly Pe 
col = c("blue", "red", "black")) 


# data for use with predict () 

new_data <- data.frame("size" = seq(16, 24, 0.05), 
"english" = mean(CASchools$english) , 
"lunch" = mean(CASchools$lunch) , 
"income" = mean(CASchools$income) , 
"HiEL" = mean(CASchools$HiEL) ) 


# add estimated regression function for model (2) 
fitted <- predict(TestScore_mod2, newdata = new_data) 


lines (new_data$size, 
fitted, 
lwd = 1.5, 
col = "blue") 


# add estimated regression function for model (5) 
fitted <- predict(TestScore_mod5, newdata = new_data) 


lines (new_data$size, 


fitted, 
lwd = 1.5, 
col = "red") 


# add estimated regression function for model (7) 
fitted <- predict(TestScore_mod7, newdata = new_data) 


lines (new_data$size, 


fitted, 
col = "black", 
lwd = 1.5, 


ity = 2) 
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For the above figure all regressors except size are set to their sample averages. 
We see that the cubic regressions (5) and (7) are almost identical. They indicate 
that the relation between test scores and the student-teacher ratio only has a 
small amount of nonlinearity since they do not deviate much from the regression 
function of (2). 


The next code chunk reproduces Figure 8.11 of the book. We use plot() and 
points() to color observations depending on HiEL. Again, the regression lines 
are drawn based on predictions using average sample averages of all regressors 
except for size. 


# draw scatterplot 


# observations with HiEL = 0 
plot (CASchools$size[CASchools$HiEL == 0], 
CASchools$score [CASchools$HiEL == 0], 
xlim =- c2 T28), 
ylim = c(600, 730), 
peh = 20); 
COE E aE N 
xlab = "Student-Teacher Ratio", 
ylab = "Test Score") 


# observations with HiEL = 1 

points (CASchools$size[CASchools$HiEL == 1], 
CASchools$score[CASchools$HiEL == 1], 
col = "steelblue", 
pch = 20) 


# add a legend 

legend("top", 
legend = c("Regression (6) with HiEL=0", "Regression (6) with HiEL=1"), 
cex = 0.7, 
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ncol = 2, 
i S cl Gee) 
col = c("green", "red")) 


# data for use with 'predict()' 

new_data <- data.frame("size" = seq(12, 28, 0.05), 
"english" = mean(CASchools$english) , 
"lunch" = mean(CASchools$lunch) , 
"income" = mean(CASchools$income) , 
"HiEL" = 0) 


# add estimated regression function for model (6) with HiEL=0 
fitted <- predict(TestScore_mod6, newdata = new_data) 


lines (new_data$size, 


fitted, 
vdi =ni 
col = "green") 


# add estimated regression function for model (6) with HiEL=1 
new_data$HiEL <- 1 


fitted <- predict(TestScore_mod6, newdata = new_data) 


lines(new_data$size, 
fitted, 
lwd = 1.5, 
col = "red") 
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Student- Teacher Ratio 


The regression output shows that model (6) finds statistically significant coeffi- 
cients on the interaction terms HiEL : size, HiEL : size? and HiEL : size’, 
i.e., there is evidence that the nonlinear relationship connecting test scores and 
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student-teacher ratio depends on the fraction of English learning students in the 
district. However, the above figure shows that this difference is not of practical 
importance and is a good example for why one should be careful when interpret- 
ing nonlinear models: although the two regression functions look different, we 
see that the slope of both functions is almost identical for student-teacher ratios 
between 17 and 23. Since this range includes almost 90% of all observations, 
we can be confident that nonlinear interactions between the fraction of English 
learners and the student-teacher ratio can be neglected. 


One might be tempted to object since both functions show opposing slopes for 
student-teacher ratios below 15 and beyond 24. There are at least to possible 
objections: 


1. There are only few observations with low and high values of the student- 
teacher ratio, so there is only little information to be exploited when es- 
timating the model. This means the estimated function is less precise in 
the tails of the data set. 


2. The above described behavior of the regression function, is a typical caveat 
when using cubic functions since they generally show extreme behavior for 


extreme regressor values. Think of the graph of f(x) = zè. 


We thus find no clear evidence for a relation between class size and test scores 
on the percentage of English learners in the district. 


Summary 


We are now able to answer the three question posed at the beginning of this 
section. 


1. In the linear models, the percentage of English learners has only little 
influence on the effect on test scores from changing the student-teacher 
ratio. This result stays valid if we control for economic background of 
the students. While the cubic specification (6) provides evidence that 
the effect the student-teacher ratio on test score depends on the share of 
English learners, the strength of this effect is negligible. 


2. When controlling for the students’ economic background we find evidence 
of nonlinearities in the relationship between student-teacher ratio and test 
scores. 


3. The linear specification (2) predicts that a reduction of the student-teacher 
ratio by two students per teacher leads to an improvement in test scores 
of about —0.73 x (—2) = 1.46 points. Since the model is linear, this effect 
is independent of the class size. Assume that the student-teacher ratio 


246 CHAPTER 8. NONLINEAR REGRESSION FUNCTIONS 


is 20. For example, the nonlinear model (5) predicts that the reduction 
increases test scores by 


64.33-18+187-(—3.42)+183- (0.059) —(64.33-20+207-(—3.42)+20?-(0.059)) = 3.3 


points. If the ratio is 22, a reduction to 20 leads to a predicted improve- 
ment in test scores of 





64.33-20+207-(—3.42)+203- (0.059) —(64.33-22+4+22?.(—3.42)+22?.(0.059)) = 2.4 


points. This suggests that the effect is stronger in smaller classes. 


8.5 Exercises 


This interactive part of the book is only available in the HTML version. 


Chapter 9 


Assessing Studies Based on 
Multiple Regression 


The majority of Chapter 9 of the book is of a theoretical nature. Therefore this 
section briefly reviews the concepts of internal and external validity in general 
and discusses examples of threats to internal and external validity of multiple 
regression models. We discuss consequences of 


e misspecification of the functional form of the regression function 
e measurement errors 

e missing data and sample selection 

e simultaneous causality 


as well as sources of inconsistency of OLS standard errors. We also review 
concerns regarding internal validity and external validity in the context of fore- 
casting using regression models. 


The chapter closes with an application in R where we assess whether results 
found by multiple regression using the CASchools data can be generalized to 
school districts of another federal state of the United States. 


For a more detailed treatment of these topics we encourage you to work through 
Chapter 9 of the book. 


The following packages and their dependencies are needed for reproduction of 
the code chunks presented throughout this chapter: 


e AER 
e mvtnorm 
e stargazer 
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library (AER) 
library (mvtnorm) 
library (stargazer) 


9.1 Internal and External Validity 
Key Concept 9.1 
Internal and External Validity 


A statistical analysis has internal validity if the statistical inference 
made about causal effects are valid for the considered population. 


An analysis is said to have external validity if inferences and conclusion 
are valid for the studies’ population and can be generalized to other 
populations and settings. 





Threats to Internal Validity 
There are two conditions for internal validity to exist: 


1. The estimator of the causal effect, which is measured the coefficient(s) of 
interest, should be unbiased and consistent. 


2. Statistical inference is valid, that is, hypothesis tests should have the de- 
sired size and confidence intervals should have the desired coverage prob- 
ability. 


In multiple regression, we estimate the model coefficients using OLS. Thus for 
condition 1. to be fulfilled we need the OLS estimator to be unbiased and 
consistent. For the second condition to be valid, the standard errors must be 
valid such that hypothesis testing and computation of confidence intervals yield 
results that are trustworthy. Remember that a sufficient condition for conditions 
1. and 2. to be fulfilled is that the assumptions of Key Concept 6.4 hold. 


Threats to External Validity 
External validity might be invalid 


e if there are differences between the population studied and the population 
of interest. 


e if there are differences in the settings of the considered populations, e.g., 
the legal framework or the time of the investigation. 
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9.2 Threats to Internal Validity of Multiple Re- 
gression Analysis 


This section treats five sources that cause the OLS estimator in (multiple) re- 
gression models to be biased and inconsistent for the causal effect of interest 
and discusses possible remedies. All five sources imply a violation of the first 
least squares assumption presented in Key Concept 6.4. 


This sections treats: 


e omitted variable Bias 


e misspecification of the functional form 


e measurement errors 


e missing data and sample selection 


e simultaneous causality bias 


Beside these threats for consistency of the estimator, we also briefly discuss 
causes of inconsistent estimation of OLS standard errors. 


250CHAPTER 9. ASSESSING STUDIES BASED ON MULTIPLE REGRESSION 


Omitted Variable Bias 


Key Concept 9.2 
Omitted Variable Bias: Should I include More Variables in My 
Regression? 


Inclusion of additional variables reduces the risk of omitted variable 
bias but may increase the variance of the estimator of the coefficient of 
interest. 


We present some guidelines that help deciding whether to include an 
additional variable: 


Specify the coefficient (s) of interest 


Identify the most important potential sources of omitted variable 
bias by using knowledge available before estimating the model. You 
should end up with a base specification and a set of regressors that 
are questionable 


Use different model specifications to test whether questionable re- 
gressors have coefficients different from zero 


Use tables to provide full disclosure of your results, i.e., present dif- 
ferent model specifications that both support your argument and 
enable the reader to see the effect of including questionable regres- 
sors 





By now you should be aware of omitted variable bias and its consequences. Key 
Concept 9.2 gives some guidelines on how to proceed if there are control variables 
that possibly allow to reduce omitted variable bias. If including additional 
variables to mitigate the bias is not an option because there are no adequate 
controls, there are different approaches to solve the problem: 


e usage of panel data methods (discussed in Chapter 10) 
e usage of instrumental variables regression (discussed in Chapter 12) 


e usage of a randomized control experiment (discussed in Chapter 13) 


Misspecification of the Functional Form of the Regression Function 


If the population regression function is nonlinear but the regression function is 
linear, the functional form of the regression model is misspecified. This leads to 
a bias of the OLS estimator. 
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Key Concept 9.3 
Functional Form Misspecification 


We say a regression suffers from misspecification of the functional form 
when the functional form of the estimated regression model differs from 


the functional form of the population regression function. Functional 
form misspecification leads to biased and inconsistent coefficient estima- 
tors. A way to detect functional form misspecification is to plot the 
estimated regression function and the data. This may also be helpful to 
choose the correct functional form. 





It is easy to come up with examples of misspecification of the functional form: 
consider the case where the population regression function is 


¥; =X? 


but the model used is 
Yi = Bo + b1 Xi + ui. 


Clearly, the regression function is misspecified here. We now simulate data and 
visualize this. 


# set seed for reproducibility 
set.seed(3) 


# simulate data set 
X<- rund (100-5 5) 
Y <- X°2 + rnorm(100) 


# estimate the regression function 
ms_mod <- lm(Y ~ X) 

ms_mod 

#> 

#> Call: 

#> lm(formula = Y ~ X) 

#> 

#> Coefficients: 

#> (Intercept) x 

#> 8.11363 -0. 04684 


# plot the data 


Plot CG wy. 
main = "Misspecification of Functional Form", 
pehe=2207 


col = "steelblue") 
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# plot the linear regression line 
abline(ms_mod, 

col = "darkred", 

lwd = 2) 


Misspecification of Functional Form 
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It is evident that the regression errors are relatively small for observations close 
to X = —3 and X = 3 but that the errors increase for X values closer to 
zero and even more for values beyond —4 and 4. Consequences are drastic: 
the intercept is estimated to be 8.1 and for the slope parameter we obtain an 
estimate obviously very close to zero. This issue does not disappear as the 
number of observations is increased because OLS is biased and inconsistent due 
to the misspecification of the regression function. 


Measurement Error and Errors-in-Variables Bias 


Key Concept 9.4 
Errors-in- Variable Bias 


When independent variables are measured imprecisely, we speak of 


errors-in-variables bias. This bias does not disappear if the sample size 
is large. If the measurement error has mean zero and is independent of 
the affected variable, the OLS estimator of the respective coefficient is 
biased towards zero. 





Suppose you are incorrectly measuring the single regressor X; so that there 


is a measurement error and you observe X; instead of X;. Then, instead of 
estimating the population the regression model 


Yi = Po + b1 Xi + ui 
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you end up estimating 


Y; = Bo + BiXit Bi(Xi — Xi) + ui 
—>{___ 








Yi = po + b1 Xi + vi 


where X; and the error term v; are correlated. Thus OLS would be biased and 
inconsistent for the true 6; in this example. One can show that direction and 
strength of the bias depend on the correlation between the observed regressor, 


Xi, and the measurement error, w; = X;—X;. This correlation in turn depends 
on the type of the measurement error made. 


The classical measurement error model assumes that the measurement error, 
w;, has zero mean and that it is uncorrelated with the variable, X;, and the 
error term of the population regression model, u;: 


X; = Xi + w, Pwi,u; = 0, Pwi, Xi = 0 (9.1) 
Then it holds that 2 
p ox 

aa oa 2 

By Ey z ae By (9 ) 


which implies inconsistency as o%,02, > 0 such that the fraction in (9.2) is 


smaller than 1. Note that there are two extreme cases: 


1. If there is no measurement error, o2, = 0 such that Bi 2, By. 


2. if 02, > o% we have 8, & 0. This is the case if the measurement error is 
so large that there essentially is no information on X in the data that can 
be used to estimate 61. 


The most obvious way to deal with errors-in-variables bias is to use an accu- 
rately measured X. If this not possible, instrumental variables regression is an 
option. One might also deal with the issue by using a mathematical model of 
the measurement error and adjust the estimates appropriately: if it is plausible 
that the classical measurement error model applies and if there is information 
that can be used to estimate the ratio in equation (9.2), one could compute an 
estimate that corrects for the downward bias. 


For example, consider two bivariate normally distributed random variables X, Y. 
It is a well known result that the conditional expectation function of Y given 
X has the form 


E(Y|X) = BY) + pxy = [X — B(X)]. (9.3) 
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Thus for 


re ar [Gane io) (9.4) 


according to (9.3), the population regression function is 


Y; = 100 + 0.5(X; — 50) 
=75 + 0.5X;. 


Now suppose you gather data on X and Y, but that you can only measure 
Xi = X; + w; with wi E N (0,10). Since the w; are independent of the X;, 
there is no correlation between the X; and the w; so that we have a case of the 
classical measurement error model. We now illustrate this example in R using 
the package mvtnorm (Genz et al., 2020). 


# set seed 
set.seed(1) 


# load the package ‘mutnorm' and simulate bivariate normal data 
library (mvtnorm) 
dat <- data.frame( 
rmvnorm(1000, c(50, 100), 
sigma = cbind(c(10, 5), c(5, 10)))) 


# set columns names 
colnames(dat) <- c("X", "Y") 


We now estimate a simple linear regression of Y on X using this sample data 
and run the same regression again but this time we add i.i.d. M/(0,10) errors 
added to X. 


# estimate the model (without measurement error) 
noerror_mod <- lm(Y ~ X, data = dat) 


# estimate the model (with measurement error in X) 
dat$X <- dat$X + rnorm(n = 1000, sd = sqrt(10)) 
error_mod <- lm(Y ~ X, data = dat) 


# print estimated coefficients to console 
noerror_mod$coefficients 

#> (Intercept) Xx 

#> 76.3002047 0.4755264 
error_mod$coefficients 

#> (Intercept) X 

#> 87.2176004 0.255212 
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Next, we visualize the results and compare with the population regression func- 
tion. 


# plot sample data 
plot (dat$X, dat$yY, 


peche—s20r 

col = "steelblue", 
xllab = "K", 

ylab = "Y") 


# add population regression function 
abline(coef = c(75, 0.5), 

cols idarkoreentiy 

lwd = 1.5) 


# add estimated regression functions 
abline(noerror_mod, 

col = "purple", 

lwd = 1.5) 


abline(error_mod, 
col = "darkred", 
iwda = i5) 


# add legend 
legend("topleft", 


bg = “transparent”, 
cex = 0.8, 
Ap = ulg 


col c("darkgreen", "purple", "darkred"), 


legend = c("Population", "No Errors", "Errors")) 
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In the situation without measurement error, the estimated regression function 
is close to the population regression function. Things are different when we 
use the mismeasured regressor X: both the estimate for the intercept and the 
estimate for the coefficient on X differ considerably from results obtained using 
the “clean” data on X. In particular 6, = 0.255, so there is a downward bias. 
We are in the comfortable situation to know o% and o2. This allows us to 
correct for the bias using (9.2). Using this information we obtain the biased- 
corrected estimate 


ox tow 3 E 10 + 10 


- 0.255 = 0.51 
o3, i 10 





which is quite close to 6, = 0.5, the true coefficient from the population regres- 
sion function. 


Bear in mind that the above analysis uses a single sample. Thus one may argue 
that the results are just a coincidence. Can you show the contrary using a 
simulation study? 


Missing Data and Sample Selection 


Key Concept 9.5 
Sample Selection Bias 


When the sampling process influences the availability of data and when 
there is a relation of this sampling process to the dependent variable 


that goes beyond the dependence on the regressors, we say that there is 
a sample selection bias. This bias is due to correlation between one or 
more regressors and the error term. Sample selection implies both bias 
and inconsistency of the OLS estimator. 





There are three cases of sample selection. Only one of them poses a threat to 
internal validity of a regression study. The three cases are: 


1. Data are missing at random. 
2. Data are missing based on the value of a regressor. 
3. Data are missing due to a selection process which is related to the depen- 


dent variable. 


Let us jump back to the example of variables X and Y distributed as stated in 
equation (9.4) and illustrate all three cases using R. 


If data are missing at random, this is nothing but loosing observations. For 
example, loosing 50% of the sample would be the same as never having seen the 
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(randomly chosen) half of the sample observed. Therefore, missing data do not 
introduce an estimation bias and “only” lead to less efficient estimators. 


# set seed 
set.seed(1) 


# simulate data 
dat <- data.frame( 
rmvnorm(1000, c(50, 100), 
sigma = cbind(c(10, 5), c(5, 10)))) 


colnames(dat) <- c("X", "Y") 


# mark 500 randomly selected observations 
id <- sample(1:1000, size = 500) 


plot (dat$x[-id] , 


dat$Y[-id] , 

col = "steelblue", 

pehe—2 207, 

cex = 0.8, 

zlab = AR 

ylab = "Y") 

points (dat$X[id], 

dat$Y[id], 
cex = 0.8, 
col = "gray", 
pch = 20) 


# add the population regression function 
abline(coef = c(75, 0.5), 

col = “darkgreen!; 

lwd = 1.5) 


# add the estimated regression function for the full sample 
abline(noerror_mod) 


# estimate model case 1 and add the regression line 
dat <- dat[-id, ] 


ci_mod <- lm(dat$Y ~ dat$X, data = dat) 
abline(c1_mod, col = "purple") 


# add a legend 
legend("topleft", 
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lty = 1, 

bg = "transparent", 

cex = 0.8, 

col = c("darkgreen", "black", "purple"), 


legend = c("Population", "Full sample", "500 obs. randomly selected") ) 
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The gray dots represent the 500 discarded observations. When using the re- 
maining observations, the estimation results deviate only marginally from the 
results obtained using the full sample. 


Selecting data randomly based on the value of a regressor has also the effect of 
reducing the sample size and does not introduce estimation bias. We will now 
drop all observations with X > 45, estimate the model again and compare. 


# set random seed 
set.seed(1) 


# simulate data 
dat <- data.frame( 
rmvnorm(1000, c(50, 100), 
sigma = cbind(c(10, 5), c(5, 10)))) 


colnames(dat) <- c("X", "Y") 


# mark observations 
id <- dat$X >= 45 


plot (dat$X[-id] , 


dat$Y[-id], 
col = "steelblue", 
cex = 0.8, 


pene 20); 
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xlaby = exe 
ylab = "Y") 
points (dat$x[id] , 
dat$Y[id], 

col = "gray", 
cex = 0.8, 
pch = 20) 


# add population regression function 
abline(coef = c(75, 0.5), 

col = “darkgreen",; 

lwd = 1.5) 


# add estimated regression function for full sample 
abline(noerror_mod) 


# estimate model case 1, add regression line 
dat <- dat[-id, ] 


c2_mod <- lm(dat$Y ~ dat$X, data = dat) 
abline(c2_mod, col = "purple") 


# add legend 
legend("topleft", 


ey = als 
be = “transparent, 
cex = 0.8, 


col = c("darkgreen", "black", "purple"), 
legend = c("Population", "Full sample", "Obs. with X <= 45")) 
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Note that although we dropped more than 90% of all observations, the estimated 
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regression function is very close to the line estimated based on the full sample. 


In the third case we face sample selection bias. We can illustrate this by using 
only observations with X; < 55 and Y; > 100. These observations are easily 
identified using the function which() and logical operators: which(dat$X < 55 
& dat$Y > 100) 


# set random seed 
set.seed(1) 


# simulate data 
dat <- data.frame( 
rmvnorm(1000, c(50,100), 
sigma = cbind(c(10,5), c(5,10)))) 


colnames (dat) <- c("X","Y") 


# mark observations 
id <- which(dat$X <= 55 & dat$Y >= 100) 


plot (dat$X[-id], 
dat$Y[-id], 
coll =i eras, 
cex = 0.8, 
peni = 20 
labe=" 4u, 
ylab = "Y") 


points (dat$X[id], 
dat$Y [id], 
col "steelblue", 
cex = 0.8, 
pch = 20) 


# add population regression function 
abline(coef = c(75, 0.5), 

coll = "darkereent; 

lwd = 1.5) 


# add estimated regression function for full sample 
abline(noerror_mod) 


# estimate model case 1, add regression line 
dat <- dat[id, ] 


c3_mod <- lm(dat$Y ~ dat$X, data = dat) 
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abline(c3_mod, col = "purple") 


# add legend 
legend("topleft", 
leya cal 
pg = "transparent" ; 
cex = 0.8, 
col = c("darkgreen", "black", "purple"), 
legend = c("Population", "Full sample", "X <= 55 & Y >= 100")) 
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We see that the selection process leads to biased estimation results. 


There are methods that allow to correct for sample selection bias. However, 
these methods are beyond the scope of the book and are therefore not considered 
here. The concept of sample selection bias is summarized in Key Concept 9.5. 


Simultaneous Causality 


Key Concept 9.6 
Simultaneous Causality Bias 


So far we have assumed that the changes in the independent variable 
X are responsible for changes in the dependent variable Y. When the 


reverse is also true, we say that there is simultaneous causality between 
X and Y. This reverse causality leads to correlation between X and the 
error in the population regression of interest such that the coefficient on 
X is estimated with a bias. 





Suppose we are interested in estimating the effect of a 20% increase in cigarettes 
prices on cigarette consumption in the United States using a multiple regres- 
sion model. This may be investigated using the dataset CigarettesSW which 


262CHAPTER 9. ASSESSING STUDIES BASED ON MULTIPLE REGRESSION 


is part of the AER package. CigarettesSW is a panel data set on cigarette con- 
sumption for all 48 continental U.S. federal states from 1985-1995 and provides 
data on economic indicators and average local prices, taxes and per capita pack 
consumption. 


After loading the data set, we pick observations for the year 1995 and plot 
logarithms of the per pack price, price, against pack consumption, packs, and 
estimate a simple linear regression model. 


# load the data set 

library (AER) 

data("CigarettesSW") 

c1995 <- subset (CigarettesSW, year == "1995") 


# estimate the model 

cigcon_mod <- lm(log(packs) ~ log(price), data = c1995) 
cigcon_mod 

#> 

#> Call: 

#> lm(formula = log(packs) ~ log(price), data = c1995) 
#> 

#> Coefficients: 

#> (Intercept) Log (price) 

#> 10.850 Shs aks} 


# plot the estimated regression line and the data 
plot (log(c1995$price), log(c1995$packs) , 
xlab = "ln(Price)", 


ylab = "In(Consumption)", 

main = "Demand for Cigarettes", 
pch =" 20, 

col = "steelblue") 


abline(cigcon_mod, 
col = "darkred", 
lwd = 1.5) 
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Demand for Cigarettes 
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Remember from Chapter 8 that, due to the log-log specification, in the popu- 
lation regression the coefficient on the logarithm of price is interpreted as the 
price elasticity of consumption. The estimated coefficient suggests that a 1% 
increase in cigarettes prices reduces cigarette consumption by about 1.2%, on 
average. Have we estimated a demand curve? The answer is no: this is a classic 
example of simultaneous causality, see Key Concept 9.6. The observations are 
market equilibria which are determined by both changes in supply and changes 
in demand. Therefore the price is correlated with the error term and the OLS 
estimator is biased. We can neither estimate a demand nor a supply curve 
consistently using this approach. 


We will return to this issue in Chapter 12 which treats instrumental variables 
regression, an approach that allows consistent estimation when there is simul- 
taneous causality. 


Sources of Inconsistency of OLS Standard Errors 


There are two central threats to computation of consistent OLS standard errors: 


1. Heteroskedasticity: implications of heteroskedasticiy have been discussed 
in Chapter 5. Heteroskedasticity-robust standard errors as computed by 
the function vcovHC() from the package sandwich produce valid standard 
errors under heteroskedasticity. 


2. Serial correlation: if the population regression error is correlated across 
observations, we have serial correlation. This often happens in applica- 
tions where repeated observations are used, e.g., in panel data studies. 
As for heteroskedasticity, vcovHC() can be used to obtain valid standard 
errors when there is serial correlation. 
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Inconsistent standard errors will produce invalid hypothesis tests and wrong 
confidence intervals. For example, when testing the null that some model coef- 
ficient is zero, we cannot trust the outcome anymore because the test may fail 
to have a size of 5% due to the wrongly computed standard error. 


Key Concept 9.7 summarizes all threats to internal validity discussed above. 


Key Concept 9.7 
Threats to Internal Validity of a Regression Study 


The five primary threats to internal validity of a multiple regression 
study are: 


1. Omitted variables 


2. Misspecification of functional form 


3. Errors in variables (measurement errors in the regressors) 


4. Sample selection 


5. Simultaneous causality 


All these threats lead to failure of the first least squares assumption 
E(u;| Xii, N X ri) 2 0 
so that the OLS estimator is biased and inconsistent. 


Furthermore, if one does not adjust for heteroskedasticity and/or serial 
correlation, incorrect standard errors may be a threat to internal validity 
of the study. 





9.3 Internal and External Validity when the Re- 
gression is Used for Forecasting 


Recall the regression of test scores on the student-teacher ratio (ST R) performed 
in Chapter 4: 


linear_model <- lm(score ~ STR, data = CASchools) 
linear_model 

#> 

#> Call: 
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#> lm(formula = score ~ STR, data = CASchools) 


#> 

#> Coefficients: 

#> (Intercept) STR 
#> 698.93 -2.28 


The estimated regression function was 


TestScore = 698.9 — 2.28 x STR. 


The book discusses the example of a parent moving to a metropolitan area who 
plans to choose where to live based on the quality of local schools: a school dis- 
trict’s average test score is an adequate measure for the quality. However, the 
parent has information on the student-teacher ratio only such that test scores 
need to be predicted. Although we have established that there is omitted vari- 
able bias in this model due to omission of variables like student learning oppor- 
tunities outside school, the share of English learners and so on, linear_model 
may in fact be useful for the parent: 


The parent need not care if the coefficient on ST R has causal interpretation, she 
wants STR to explain as much variation in test scores as possible. Therefore, 
despite the fact that linear_model cannot be used to estimate the causal effect 
of a change in STR on test scores, it can be considered a reliable predictor of 
test scores in general. 


Thus, the threats to internal validity as summarized in Key Concept 9.7 are 
negligible for the parent. This is, as instanced in the book, different for a 
superintendent who has been tasked to take measures that increase test scores: 
she requires a more reliable model that does not suffer from the threats listed 
in Key Concept 9.7. 


Consult Chapter 9.3 of the book for the corresponding discussion. 


9.4 Example: Test Scores and Class Size 


This section discusses internal and external validity of the results gained from 
analyzing the California test score data using multiple regression models. 


External Validity of the Study 


External validity of the California test score analysis means that its results can 
be generalized. Whether this is possible depends on the population and the 
setting. Following the book we conduct the same analysis using data for fourth 
graders in 220 public school districts in Massachusetts in 1998. Like CASchool1s, 
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the data set MASchools is part of the AER package (Kleiber and Zeileis, 2020). 
Use the help function (?7MASchools) to get information on the definitions of all 


the variables contained. 


We start by loading the data set and proceed by computing some summary 


statistics. 

# attach the 'MASchools' dataset 

data("MASchools") 

summary (MASchool1s) 

#> district municipality expreg expspecial 
#> Length:220 Length: 220 Min. 12905 = Min. 1 8832 
#> Class :character Class :character ist Qu.:4065 1st Qu.: 7442 
#> Mode :character Mode :character Median :4488 Median : 8354 
#> Mean :4605 Mean : 8901 
#> 3rd Qu. :4972 Sna (Nae Lee 
#> Maz. 28759 Mac: 153569 
#> 

#> expbil expocc exptot scratto 

#> Min. ; O Min. z O Min. 13465 Min. 12.300 

#> Tst Qu.: 0 Tst Qu.: 0 ist Qu: 4730 LSC QU On LOO 

#> Median : O Median : O Median :5155 Median : 7.800 

#> Mean 3037 Mean el LOL ee Mean :5370 Mean T8- sHOG/ 

#> 3ra Qu. 0 ENAC O iS 0 3ra Qu -57/89 Sand une 97800 

#> Maz. 1295140 Mar. 1508S) Mant 29868 Maz. 18.400 

#> NA's 29 

#> special lunch stratio income 

#> Min. SOO Mon 7 0.40 Min. IO Mon: : 9.686 

Ho Ist Qu. 13738 R Aa E ee) Siene Aig ikek 10) ISt QU e 15223 

#> Median :15.45 Median :10.55 Median :17.10 Median :17 128 

#> Mean 115.97 Mean £15.32 Mean 17.34 Mean 185747 

#> 3rd Qu.::17-93 3rd Qu 20102 ond Gusc19302 Sra Qu: 20376 

#> Maz. 134.30 Maz. 5 Oc se Mans 127.00 Maz. 46.855 

#> 

#> score4 score& salary english 

#> Min. 7658.0 Min. [641.0 Min. 124.96 Min : 0.0000 
HE ISt QU SO lO, Tst Qu 1685.40 Tot OU Soret sh) Tst Qu- e 070000 
#> Median :711.0 Median :698.0 Median :35.88 Median : 0.0000 
#> Mean :709.8 Mean [698.4 Mean 135.99 Mean LLT 

cee Irda Qu. 172010 gra Qu T1210 sra Qu -o 96ra Qu 08859 

#> Maz. 3140.0 Maz. FCM sO) Juliane [44.49 Maz. 24.4939 

#> NA's :40 NA's 125 


It is fairly easy to replicate key components of Table 9.1 of the book using R. To 
be consistent with variable names used in the CASchools data set, we do some 


formatting beforehand. 
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# Customized variables in MASchools 
MASchools$score <- MASchools$score4 
MASchools$STR <- MASchools$stratio 


# Reproduce Table 9.1 of the book 
vars <- c("score", "STR", "english", "lunch", "income" 


cbind(CA_mean = sapply(CASchools[, vars], mean), 


CA_sd = sapply(CASchools[, vars], sd), 
MA_mean = sapply(MASchools[, vars], mean), 
MA_sd = sapply(MASchools[, vars], sd)) 
#> CA_mean CA_sd MA_mean MA_sd 
#> score 654.15655 19.053347 709.827273 15.1264°74 
#> STR 19.64043 1.891812 17.344091 2.276666 


#> english 15.76816 18.285927 1.117676 2.900940 
#> Lunch 44.70524 27.123381 15.315909 15.060068 
#> income 15.31659 7.225890 18.746764 5.807637 


The summary statistics reveal that the average test score is higher for school 
districts in Massachusetts. The test used in Massachusetts is somewhat differ- 
ent from the one used in California (the Massachusetts test score also includes 
results for the school subject “Science”), therefore a direct comparison of test 
scores is not appropriate. We also see that, on average, classes are smaller in 
Massachusetts than in California and that the average district income, average 
percentage of English learners as well as the average share of students receiving 
subsidized lunch differ considerably from the averages computed for California. 
There are also notable differences in the observed dispersion of the variables. 


Following the book we examine the relationship between district income and test 
scores in Massachusetts as we have done before in Chapter 8 for the California 
data and reproduce Figure 9.2 of the book. 


# estimate linear model 

Linear_model_MA <- lm(score ~ income, data = MASchools) 
Linear_model_MA 

#> 

#> Call: 

#> lm(formula = score ~ income, data = MASchools) 

#> 

#> Coefficients: 

#> (Intercept) income 

> 679.387 1.624 


# estimate linear-log model 
Linearlog model_MA <- lm(score ~ log(income), data = MASchools) 
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Linearlog_model_MA 

#> 

#> Call: 

#> lm(formula = score ~ log(income), data = MASchools) 
#> 

#> Coefficients: 

#> (Intercept) log(income) 

#> 600.80 ithe (al 


# estimate Cubic model 

cubic_model_ MA <- lm(score ~ I(income) + I(income*2) + I(income*3), data = MASchools) 
cubic_model_ MA 

#> 

#> Call: 

#> lm(formula = score ~ I(income) + I(income*2) + I(income*3), data = MASchools) 

#> 

#> Coefficients: 

#> (Intercept) I(income) I(income*2) I(income~3) 


#> 600.398531 10.635382 -0.296887 0. 002762 
# plot data 
plot (MASchools$income, MASchools$score, 

pche—=205 

col = "steelblue", 

xlab = "District income", 

ylab = "Test score", 

xlim = c(0, 50), 

ylim = c(620, 780)) 


# add estimated regression line for the linear model 
abline(Linear_model_MA, lwd = 2) 


# add estimated regression function for Linear-log model 
order_id <- order (MASchools$income) 


lines (MASchools$income[order_id], 
fitted(Linearlog_ model_MA) [order_id], 
coll = “darkereen, 
lwd = 2) 


# add estimated cubic regression function 
lines(x = MASchools$income [order_id], 
y = fitted(cubic_model_MA) [order_id], 
coll = “orange”; 
lwd = 2) 
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# add a legend 
legend("topleft", 


legend = c("Linear", "Linear-Log", "Cubic"), 
lty = 1, 
col = c("Black", "darkgreen", "orange")) 
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The plot indicates that the cubic specification fits the data best. Interestingly, 
this is different from the CASchools data where the pattern of nonlinearity is 
better described by the linear-log specification. 


We continue by estimating most of the model specifications used for analysis of 
the CASchools data set in Chapter 8 and use stargazer() (Hlavac, 2018) to 
generate a tabular representation of the regression results. 


# add 'HtEL' to 'MASchools' 
MASchools$HiEL <- as.numeric(MASchools$english > median(MASchools$english) ) 


# estimate the model specifications from Table 9.2 of the book 
TestScore_MA_ modi <- lm(score ~ STR, data = MASchools) 


TestScore_MA_mod2 <- lm(score ~ STR + english + lunch + log(income) , 
data = MASchools) 


TestScore_MA_mod3 <- lm(score ~ STR + english + lunch + income + I(income*2) 
+ I(income*3), data = MASchools) 


TestScore_MA_mod4 <- lm(score ~ STR + I(STR*2) + I(STR°3) + english + lunch + income 


+ I(income*2) + I(income*3), data = MASchools) 





TestScore_MA_mod5 <- lm(score ~ STR + I(income*2) + I(income*3) + HiEL:STR + lunch 


+ income, data = MASchools) 
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TestScore_ MA mod6 <- lm(score ~ STR + I(income*2) + I(income*3) + HiEL + HiEL:STR + lw 
+ income, data = MASchools) 


# gather robust standard errors 

rob_se <- list (sqrt (diag (vcovHC(TestScore_MA_mod1, type = "HCi"))), 
sqrt (diag (vcovHC(TestScore_MA_mod2, type = "HCi"))), 
sqrt (diag (vcovHC(TestScore_MA_mod3, type = "HCi"))), 
sqrt (diag (vcovHC(TestScore_MA_mod4, type = "HCi"))), 
sqrt (diag (vcovHC(TestScore_MA_mod5, type = "HCi"))), 
sqrt (diag (vcovHC(TestScore_MA_mod6, type = "HCi")))) 


# generate a table with 'stargazer()' 
library (stargazer) 


stargazer(Linear_model_MA, TestScore_MA_mod2, TestScore_MA_mod3, 
TestScore_MA_mod4, TestScore_MA_mod5, TestScore_MA_mod6, 


title = "Regressions Using Massachusetts Test Score Data", 
type = ulatexri 
digits = 3, 


header = FALSE, 

se = rob_se, 

object.names = TRUE, 

model.numbers = FALSE, 

column.labels GCap, Wamp Tanne, nips Wai, " (VI)")) 


% Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E- 
mail: hlavac at fas.harvard.edu % Date and time: Tue, Sep 15, 2020 - 11:33:34 
% Requires LaTeX packages: rotating 


Next we reproduce the F-statistics and p-values for testing exclusion of groups 
of variables. 


# F-test model (3) 

linearHypothesis(TestScore_MA_mod3, 
c("I(income*2)=0", "I(income*3)=0"), 
vcov. = vcovHC, type = "HC1i") 

#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> I(income“2) 

#> I(income~3) 

#> 

#> Model 1: restricted model 

#> Model 2: score ~ STR + english + lunch + income + I(income 2) + I(income~3) 

#> 


Noll 
OS 
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#> 
#> 
#> 
#> 
#> 
#> 
#> 


Note: Coefficient covariance matrix supplied. 


Res.Df Df F PrF) 
1 215 
2 213 2 6.227 0.002354 ** 


Sa gnij codes O LAr TOOT eNO OL URI OOS Ont el 


# F-tests model (4) 
linearHypothesis(TestScore_MA_mod4, 


#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 


c("STR=0", "I(STR*2)=0", "I(STR*3)=0"), 
vcov. = vcovHC, type = "HC1i") 
Linear hypothesis test 


Hypothesis: 
SIR = 30 
I(STR“2) 
I(STR^3) = 


Nol 
SS 


Model 1: restricted model 
Model 2: score ~ STR + I(STR°2) + I(STR~3) + english + Lunch + income + 
I(income“2) + I(income~3) 


Note: Coefficient covariance matrix supplied. 


Res.Df Df F Pr(>F) 
£ 214 
A 211 32. 326) 0.077728 e 
Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


linearHypothesis(TestScore_MA_mod4, 


c("I(STR*2)=0", "I(STR*3)=0"), 
vcov. = vcovHC, type = "HC1") 
Linear hypothesis test 


Hypothesis: 
I(STR°2) = 0 
I(STR“3) = 0 


Model 1: restricted model 
Model 2: score ~ STR + I(STR°2) + I(STR“~3) + english + lunch + income + 


I(income*2) + I(income~3) 


Note: Coefficient covariance matris supplied. 
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#> 
#> 
#> 
#> 


il 
2 
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Res.Df Df F EI) 
213 
211 2 0.3396 0.7124 


linearHypothesis(TestScore_MA_mod4, 


#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 


c("I(income*2)=0", "I(income*3)=0"), 
vcov. = vcovHC, type = "HC1") 


Linear hypothesis test 


Hypothesis: 
I(income*2) = 0 
I(income*3) = 0 


Model 1: restricted model 
Model 2: score ~ STR + I(STR°2) + I(STR°~3) + english + lunch + income + 


I(income“2) + I(income~3) 


Note: Coefficient covariance matriz supplied. 


Res.Df Df Eee ry) 
{ 213 
2 211 2 5.7043 0.003866 ** 
otgnif. codes: 0 Zr 07001 “**" O OTA 0705 EO * " 7 


# F-tests model (5) 
linearHypothesis(TestScore_MA_mod5, 


c("STR=0", "STR:HiEL=0"), 
vcov. = vcovHC, type = "HC1") 


Linear hypothesis test 


Hypothesis: 
STR = 0 


STR:HtEL 


0 


Model 1: restricted model 
Model 2: score ~ STR + HiEL + HiEL:STR + lunch + income + I(income“2) + 


I (income ~3) 


Note: Coefficient covariance matrix supplied. 


1 
2 


Res.Df Df F Pr(>F) 
21 
212 2 3.7663 0.0247 * 
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#. ——— 
Ho Signrj. codes (OER! TOTOO Vee’ ION OL) SOROS ee Ont td 


linearHypothesis(TestScore_MA_mod5, 
c("I(income*2)=0", "I(income*3)=0"), 
vcov. = vcovHC, type = "HC1i") 

#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> I (income ~2) 

#> I(income~3) 

#> 

#> Model 1: restricted model 

#> Model 2: score ~ STR + HiEL + HiEL:STR + lunch + income + I(income“2) + 


Noll 
Se 


#> I (income 3) 

#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df is JID) 

#> 1 214 

#> 2 212 2 3.2201 0.04191 * 

eo oS 

#> Stgnif. codes: 0 '***' 0.001 '**" 0.01 '*#" 0.05 ".' O71 ' " J 


linearHypothesis(TestScore_MA_mod5, 
c("HiEL=0", "STR:HiEL=0"), 
vcov. = vcovHC, type = "HC1") 
#> Linear hypothesis test 


#> Hypothesis: 
#> HiEL = 0 
#> STR:HiEL = 0 


#> Model 1: restricted model 
#> Model 2: score ~ STR + HiEL + HiEL:STR + lunch + income + I(income2) + 


#> I (income 3) 

#> 

#> Note: Coefficient covariance matrix supplied. 
#> 

#> Res.Df Df FOP (2F) 

#> 1 214 


#2 B12 2 67 02328 


# F-test Model (6) 
linearHypothesis(TestScore_MA_mod6, 
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c("I(income*2)=0", "I(income*3)=0"), 
vcov. = vcovHC, type = "HC1i") 
#> Linear hypothesis test 
#> 
#> Hypothesis: 
#> I(income^2) = 
#> I(income~3) 
#> 
#> Model 1: restricted model 


Nou 
SS) 


#> Model 2: score ~ STR + lunch + income + I(income*2) + I(income~3) 


#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df F PrOF) 

#> 1 216 

#> 2 214 24.2776 0.01508 * 

#> --- 

H> Signi j- codes: O Crkt TOTOO kk MOTOI k MO OSOA] 


We see that, in terms of R?, specification (3) which uses a cubic to model the 
relationship between district income and test scores indeed performs better than 
the linear-log specification (2). Using different F-tests on models (4) and (5), 
we cannot reject the hypothesis that there is no nonlinear relationship between 
student teacher ratio and test score and also that the share of English learners 
has an influence on the relationship of interest. Furthermore, regression (6) 
shows that the percentage of English learners can be omitted as a regressor. 
Because of the model specifications made in (4) to (6) do not lead to substan- 
tially different results than those of regression (3), we choose model (3) as the 
most suitable specification. 


In comparison to the California data, we observe the following results: 


1. Controlling for the students’ background characteristics in model speci- 
fication (2) reduces the coefficient of interest (student-teacher ratio) by 
roughly 60%. The estimated coefficients are close to each other. 


2. The coefficient on student-teacher ratio is always significantly different 
from zero at the level of 1% for both data sets. This holds for all considered 
model specifications in both studies. 


3. In both studies the share of English learners in a school district is of little 
importance for the estimated impact of a change in the student-teacher 
ratio on test score. 


The biggest difference is that, in contrast to the California results, we do not 
find evidence of a nonlinear relationship between test scores and the student 
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teacher ratio for the Massachusetts data since the corresponding F-tests for 
model (4) do not reject. 


As pointed out in the book, the test scores for California and Massachusetts have 
different units because the underlying tests are different. Thus the estimated 
coefficients on the student-teacher ratio in both regressions cannot be compared 
before standardizing test scores to the same units as 


Testscore — TestScore 





OT estScore 


for all observations in both data sets and running the regressions of interest 
using the standardized data again. One can show that the coefficient on student- 
teacher ratio in the regression using standardized test scores is the coefficient of 
the original regression divided by the standard deviation of test scores. 


For model (3) of the Massachusetts data, the estimated coefficient on the 
student-teacher ratio is —0.64. A reduction of the student-teacher ratio by 
two students is predicted to increase test scores by —2 - (—0.64) = 1.28 points. 
Thus we can compute the effect of a reduction of student-teacher ratio by two 
students on the standardized test scores as follows: 


TestScore_ MA mod3$coefficients[2] / sd(MASchools$score) * (-2) 
#> STR 
#> 0.08474001 


For Massachusetts the predicted increase of test scores due to a reduction of 
the student-teacher ratio by two students is 0.085 standard deviations of the 
distribution of the observed distribution of test scores. 


Using the linear specification (2) for California, the estimated coefficient on the 
student-teacher ratio is —0.73 so the predicted increase of test scores induced by 
a reduction of the student-teacher ratio by two students is —0.73 - (—2) = 1.46. 
We use R to compute the predicted change in standard deviation units: 


TestScore_mod2$coefficients[2] / sd(CASchools$score) * (-2) 
#> STR 
#> 0.07708103 


This shows that the the predicted increase of test scores due to a reduction 
of the student-teacher ratio by two students is 0.077 standard deviation of the 
observed distribution of test scores for the California data. 


In terms of standardized test scores, the predicted change is essentially the same 
for school districts in California and Massachusetts. 


Altogether, the results support external validity of the inferences made using 
data on Californian elementary school districts — at least for Massachusetts. 
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Internal Validity of the Study 


External validity of the study does not ensure their internal validity. Although 
the chosen model specification improves upon a simple linear regression model, 
internal validity may still be violated due to some of the threats listed in Key 
Concept 9.7. These threats are: 


e omitted variable bias 

e misspecification of functional form 
e errors in variables 

e sample selection issues 

e simultaneous causality 

e heteroskedasticity 


e correlation of errors across observations 


Consult the book for an in-depth discussion of these threats in view of both test 
score studies. 


Summary 


We have found that there is a small but statistically significant effect of the 
student-teacher ratio on test scores. However, it remains unclear if we have 
indeed estimated the causal effect of interest since — despite that our approach 
including control variables, taking into account nonlinearities in the population 
regression function and statistical inference using robust standard errors — the 
results might still be biased for example if there are omitted factors which we 
have not considered. Thus internal validity of the study remains questionable. 
As we have concluded from comparison with the analysis of the Massachusetts 
data set, this result may be externally valid. 


The following chapters address techniques that can be remedies to all the threats 
to internal validity listed in Key Concept 9.7 if multiple regression alone is insuf- 
ficient. This includes regression using panel data and approaches that employ 
instrumental variables. 


9.5 Exercises 


This interactive part of the book is only available in the HTML version. 


278CHAPTER 9. ASSESSING STUDIES BASED ON MULTIPLE REGRESSION 


Chapter 10 


Regression with Panel Data 


Regression using panel data may mitigate omitted variable bias when there is 
no information on variables that correlate with both the regressors of interest 
and the independent variable and if these variables are constant in the time 
dimension or across entities. Provided that panel data is available panel regres- 
sion methods may improve upon multiple regression models which, as discussed 
in Chapter 9, produce results that are not internally valid in such a setting. 


This chapter covers the following topics: 


e notation for panel data 
e fixed effects regression using time and/or entity fixed effects 
e computation of standard errors in fixed effects regression models 


Following the book, for applications we make use of the dataset Fatalities 
from the AER package (Kleiber and Zeileis, 2020) which is a panel dataset re- 
porting annual state level observations on U.S. traffic fatalities for the period 
1982 through 1988. The applications analyze if there are effects of alcohol taxes 
and drunk driving laws on road fatalities and, if present, how strong these effects 
are. 

We introduce plm(), a convenient R function that enables us to estimate linear 
panel regression models which comes with the package plm (Croissant et al., 
2020). Usage of plm() is very similar as for the function 1m() which we have 
used throughout the previous chapters for estimation of simple and multiple 
regression models. 


The following packages and their dependencies are needed for reproduction of 
the code chunks presented throughout this chapter on your computer: 


e AER 
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e plm 
e stargazer 


Check whether the following code chunk runs without any errors. 


library (AER) 
library (plm) 
library (stargazer) 


10.1 Panel Data 


Key Concept 10.1 
Notation for Panel Data 


In contrast to cross-section data where we have observations on n sub- 
jects (entities), panel data has observations on n entities at T > 2 time 
periods. This is denoted 


(Ka Veh C= heem amd b= ecoa Ii 


where the index 2 refers to the entity while t refers to the time period. 





Sometimes panel data is also called longitudinal data as it adds a temporal 
dimension to cross-sectional data. Let us have a look at the dataset Fatalities 
by checking its structure and listing the first few observations. 


# load the package and the dataset 
library (AER) 
data(Fatalities) 


# obtain the dimension and inspect the structure 
is.data.frame (Fatalities) 

#> [1] TRUE 

dim(Fatalities) 

#> [1] 336 34 


str(Fatalities) 

#> 'data.frame': 336 obs. of 34 variables: 

#> $ state pebactoniws, Z8 levels Vat Wazi van wr al elated, 1 1 
#> $ year sehactoriw/ f levels 119g L9CG UN aa tee ome 6 7 
#> $ spirits anum la 136 1732 1 28 1-23 er 

#> $ unemp Ee AR E NE IIE (eink) Eli onc 


#> $ income S num 10544 107733 11109 11333 11662 ... 
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#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 
#> 


emppop 
beertaz 
baptist 
mormon 
drinkage 
dry 


miles 
breath 
jail 
service 
fatal 
nfatal 

$ sfatal 

$ fatal1517 
$ nfatal1517 
$ fatal1820 
$ nfatal1820 
$ fatal2124 
$ nfatal2124 
$ afatal 

$ pop 

$ pop1517 

$ pop1820 

$ pop2124 

$ milestot 

$ unempus 

$ emppopus 

$ gsp 


G a a a a a HHH HH HH 


youngdrivers: 


SING 
Sh ye 
HE 
FE 
ONG 
ONG 
INE 
mot 
NE 
: num 
: num 
: num 
: num 
: num 
: num 
: num 
: num 
: num 
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a mum O07 52: 15/72 59-3 5615 T.. 
S OVO ELVA I RA EE LNS AL aan 
2 num I0 30.3) SOO G0 nS SO oo 
: num 0.328 0.343 0.359 0.376 0.393 ... 
UML OM LOMO LON TZ Lame. 
> num 25 23 24 23.6 23.5 ... 

num 0.212 0.211 0.211 0.211 0.213 ... 
: num 7234 7836 8263 8727 8953 ... 
a Foctor w 2 levels "no, gesn Trairi 
S Foctor w 2 levela no "gea i x al al al Sl gl 2 
SPE ACtOR u 2 levels no", ges" TITT? 


839 930 932 882 1081 1110 1023 724 675 869 ... 
146 154 165 126 172181 139 131 11275179) ~.. 


99 98 94 98 119 114 89 76 60 81... 
53 71 49 66 82 94 66 40 40 51... 
879401187 76 oe. 

99 108 103 100 120 127 105 81 83 118 ... 
P 26-25 AF 28°61 2) 16 IO E aas 

IO te) 118 117 419 188: 129° 96°80 129)... 
32 35 34 45 29 30 25 36 17 33... 

309 342 305 277 361... 


3942002 3960008 3988992 4021008 4049994 ... 


209000 202000 197000 195000 204000 ... 
221553 219125 216724 214349 212000 ... 
290000 290000 288000 284000 263000 ... 
28516 31032 32961 35091 36259 ... 

QE Ds Ur? Vere Uh ae 

Cine Wie! eles) (Heal COL Gan 

-0.0221 0.0466 0.0628 0.0275 0.0321... 


# list the first few observations 


wo dk 
Vun 


19.00 
19.00 
19.00 
19.67 
21.00 
21.00 


head (Fatalities) 

#> state year spirits unemp income emppop beertaz baptist mormon drinkage 
#> 1 al 1982 1.37 14.4 10544.15 50.69204 1.539379 30.3557 0.32829 
#> 2 al 1983 1.36 13.7 10732.80 52.14703 1.788991 30.3336 0.34341 
#> 3 al 1984 1.32 11.1 11108.79 54.16809 1.714286 30.3115 0.35924 
#> 4 al 1985 1.28 8.9 11332.63 55.27114 1.652542 30.2895 0.37579 
#> 5 al 1986 1.23 9.8 11661.51 56.51450 1.609907 30.2674 0.39311 
#> 6 al 1987 1.18 7.8 11944.00 57.50988 1.560000 30.2453 0.41123 
#> dry youngdrivers miles breath jail service fatal nfatal sfatal 
#> 1 25.0063 0.211572 7233.887 no no no 839 146 99 
#> 2 22.9942 0.210768 7836.348 no no no 930 154 98 
#> 3 24.0426 0.211484 8262.990 no no no 932 165 94 
#> 4 23.6339 0.211140 8726.917 no no no 882 146 98 
#> 5 23.4647 0.213400 8952.854 no no no 1081 172 119 
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#> 6 23.7924 0.215527 9166.302 no no no 1110 181 114 

#> = fatal1517 nfatal1517 fatal1820 nfatal1i820 fatal2124 nfatal2124 afatal 
#> 1 53 e) 99 34 120 32 309.438 
#> 2 T 8 108 26 124 35 341.834 
#> 3 49 m 103 25 118 34 304.872 
#> 4 66 g 100 23 114 45 276.742 
#> 5 82 10 120 23 Hate) 29 360.716 
#> 6 94 11 127 31 138 30 368.421 
#> pop pop1i517 pop1820 pop2124 milestot unempus emppopus gsp 
#> 1 3942002 208999.6 221553.4 290000. 1 28516 Bay 57.8 -0.022124°76 
#> 2 3960008 202000.1 219125.5 290000.2 31032 OO) 57.9 0.04655825 
#> 3 3988992 197000.0 216724.1 288000.2 32961 Gs) 59.5 0.06279784 
#> 4 4021008 194999.7 214349.0 284000.3 35091 Wo 60.1 0.02748997 
#> 5 4049994 203999.9 212000.0 263000.3 36259 UO) 60.7 0.03214295 
#> 6 4082999 204999.8 208998.5 258999.8 37426 6.2 61.5 0.04897637 


# summarize the variables 'state' and 'year' 
summary(Fatalities[, c(1, 2)]) 


#> State year 

#> al 7982:28 
#> az T 1983:48 
#> ar T 1984:48 
#> ca T 1985:48 
#> co : 7 1986:48 
#> ct BP SUSIE BIKE! 


#> (Other):294 1988:48 


We find that the dataset consists of 336 observations on 34 variables. Notice 
that the variable state is a factor variable with 48 levels (one for each of the 
48 contiguous federal states of the U.S.). The variable year is also a factor 
variable that has 7 levels identifying the time period when the observation was 
made. This gives us 7 x 48 = 336 observations in total. Since all variables are 
observed for all entities and over all time periods, the panel is balanced. If there 
were missing data for at least one entity in at least one time period we would 
call the panel unbalanced. 


Example: Traffic Deaths and Alcohol Taxes 


We start by reproducing Figure 10.1 of the book. To this end we estimate sim- 
ple regressions using data for years 1982 and 1988 that model the relationship 
between beer tax (adjusted for 1988 dollars) and the traffic fatality rate, mea- 
sured as the number of fatalities per 10000 inhabitants. Afterwards, we plot the 
data and add the corresponding estimated regression functions. 
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# define the fatality rate 
Fatalities$fatal_rate <- Fatalities$fatal / Fatalities$pop * 10000 


# subset the data 
Fatalities1982 <- subset(Fatalities, year 
Fatalities1988 <- subset(Fatalities, year 


= "1982") 
= "1988") 


# estimate simple regression models using 1982 and 1988 data 
fatal1982_mod <- lm(fatal_rate ~ beertax, data = Fatalities1982) 
fatal1988_mod <- lm(fatal_rate ~ beertax, data = Fatalities1988) 


coeftest (fatal1982_mod, vcov. = vcovHC, type = "HC1") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 2.01038 0.14957 13.4408  <2e-16 *** 

#> beertax 0.14846 0.13261 1.1196 0.2687 

#> --- 

#> Stgnif. codes: O '***' 0.001 '**' O OTO OSEO ' ' 1 
coeftest (fatal1988_mod, vcov. = vcovHC, type = "HCi") 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 1.85907 0.11461 16.2205 < 2.2e-16 *** 

#> beertax 0.43875 0.12786 3.4314 0.001279 ** 

#> --- 

HO Signi- codes ON EE: On OO1 MEE O OIO Ooi ca Olselig 


The estimated regression functions are 


FatalityRate = 2.01 + 0.15 x BeerTax (1982 data), 
(0.15) (0.13) 





FatalityRate = 1.86 + 0.44 x BeerTax (1988 data). 
(0.11) (0.13) 


# plot the observations and add the estimated regression line for 1982 data 
plot(x = Fatalities1982$beertax, 

y = Fatalities1982$fatal_rate, 

xlab = "Beer tax (in 1988 dollars)", 

ylab = "Fatality rate (fatalities per 10000)", 

main = "Traffic Fatality Rates and Beer Taxes in 1982", 

ylim = c(0, 4.5), 

pehy—220); 
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col = "steelblue") 


abline(fatal1982_ mod, lwd = 1.5) 


Traffic Fatality Rates and Beer Taxes in 1982 











Fatality rate (fatalities per 10000) 


0.0 0.5 1.0 1.5 2.0 2.5 


Beer tax (in 1988 dollars) 


# plot observations and add estimated regression line for 1988 data 
plot(x = Fatalities1988$beertax, 

y = Fatalities1988$fatal_rate, 

xlab = "Beer tax (in 1988 dollars)", 

ylab = "Fatality rate (fatalities per 10000)", 


main = "Traffic Fatality Rates and Beer Taxes in 1988", 
ylim = c(0, 4.5), 
pen =" 20; 


col = "steelblue") 


abline(fatal1988_mod, lwd = 1.5) 
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Traffic Fatality Rates and Beer Taxes in 1988 














Fatality rate (fatalities per 10000) 


0.0 0.5 1.0 1.5 2.0 


Beer tax (in 1988 dollars) 


In both plots, each point represents observations of beer tax and fatality rate for 
a given state in the respective year. The regression results indicate a positive 
relationship between the beer tax and the fatality rate for both years. The 
estimated coefficient on beer tax for the 1988 data is almost three times as 
large as for the 1988 dataset. This is contrary to our expectations: alcohol 
taxes are supposed to lower the rate of traffic fatalities. As we known from 
Chapter 6, this is possibly due to omitted variable bias, since both models do 
not include any covariates, e.g., economic conditions. This could be corrected 
for using a multiple regression approach. However, this cannot account for 
omitted unobservable factors that differ from state to state but can be assumed 
to be constant over the observation span, e.g., the populations’ attitude towards 
drunk driving. As shown in the next section, panel data allow us to hold such 
factors constant. 


10.2 Panel Data with Two Time Periods: “Be- 
fore and After” Comparisons 


Suppose there are only T = 2 time periods t = 1982,1988. This allows us to 
analyze differences in changes of the the fatality rate from year 1982 to 1988. 
We start by considering the population regression model 


FatalityRate;, = Bo + 6, BeerTaxy, + B2Zi + Uit 


where the Z; are state specific characteristics that differ between states but are 
constant over time. For t = 1982 and t = 1988 we have 


Fatality Ratej1982 = Bo + 81 BeerTaxj1982 + B2Zi + usigs2, 











Fatality Rateji98g = Bo + 81 BeerTaxj1988 + 62Z; + uiigss- 
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We can eliminate the Z; by regressing the difference in the fatality rate between 
1988 and 1982 on the difference in beer tax between those years: 


Fatality Rate;1933—Fatality Rate;1982 = By (BeerT axj1988— BeerT axj1982)+Ui19ss—Ui1982 


This regression model yields an estimate for 8, robust a possible bias due to 
omission of the Z;, since these influences are eliminated from the model. Next 
we use use R to estimate a regression based on the differenced data and plot the 
estimated regression function. 


# compute the differences 
diff_fatal_rate <- Fatalities1988$fatal_rate - Fatalities1982$fatal_rate 
diff_beertax <- Fatalities1988$beertax - Fatalities1982$beertax 


# estimate a regression using differenced data 
fatal_diff_mod <- lm(diff_fatal_rate ~ diff_beertax) 


coeftest(fatal_diff_mod, vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) -0.072037 0.065355 -1.1022 0.276091 

#> diff_beertaz -1.040973 0.355006 -2.9323 0.005229 ** 

#> --- 

Signi- codes | OV EEE O O01 Ute O OIO OS EOT See, 


Including the intercept allows for a change in the mean fatality rate in the time 
between 1982 and 1988 in the absence of a change in the beer tax. 


We obtain the OLS estimated regression function 


Fatality Rate;19s3 — Fatality Rate;1982 = —0.072—1.04 x (BeerTaxj1938— BeerT ax;1982). 
(0.065) (0.36) 


# plot the differenced data 
plot(x = diff_beertax, 
y = diff_fatal_rate, 
xlab = "Change in beer tax (in 1988 dollars)", 
ylab = "Change in fatality rate (fatalities per 10000)", 
main = "Changes in Traffic Fatality Rates and Beer Taxes in 1982-1988" 
xlim = c(-0.6, 0.6), 
ylim = c(-1.5, 1), 
peh = 20; 
col = "steelblue") 


# add the regression line to plot 
abline(fatal_diff_mod, lwd = 1.5) 
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hanges in Traffic Fatality Rates and Beer Taxes in 1982-1 





4 














l l l l l l l l 
—0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 


Change in fatality rate (fatalities per 1090) 


Change in beer tax (in 1988 dollars) 


The estimated coefficient on beer tax is now negative and significantly different 
from zero at 5%. Its interpretation is that raising the beer tax by $1 causes 
traffic fatalities to decrease by 1.04 per 10000 people. This is rather large as the 
average fatality rate is approximately 2 persons per 10000 people. 


# compute mean fatality rate over all states for all time periods 
mean (Fatalities$fatal_rate) 


#> [1] 2.040444 


Once more this outcome is likely to be a consequence of omitting factors in the 
single-year regression that influence the fatality rate and are correlated with the 
beer tax and change over time. The message is that we need to be more careful 
and control for such factors before drawing conclusions about the effect of a 
raise in beer taxes. 


The approach presented in this section discards information for years 1983 to 
1987. A method that allows to use data for more than T = 2 time periods and 
enables us to add control variables is the fixed effects regression approach. 


10.3 Fixed Effects Regression 


Consider the panel regression model 


Vie = bo + PiXit + b2Zi + uit 


where the Z; are unobserved time-invariant heterogeneities across the entities 
i=1,...,n. We aim to estimate 61, the effect on Y; of a change in X; holding 
constant Z;. Letting a; = bo + 6224; we obtain the model 


Yit = Qi + B1Xit + uit. (10.1) 
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Having individual specific intercepts a;, i = 1,...,n, where each of these can 
be understood as the fixed effect of entity i, this model is called the fixed effects 
model. The variation in the a;, i = 1,...,n comes from the Z;. (10.1) can 


be rewritten as a regression model containing n — 1 dummy regressors and a 
constant: 


Yit = fo t Bi Xi t y2D2; ł y3D3i Pereg YMmDNi F Uit. (10.2) 


Model (10.2) has n different intercepts — one for every entity. (10.1) and (10.2) 
are equivalent representations of the fixed effects model. 








The fixed effects model can be generalized to contain more than just one de- 
terminant of Y that is correlated with X and changes over time. Key Concept 
10.2 presents the generalized fixed effects regression model. 


Key Concept 10.2 
The Fixed Effects Regression Model 


The fixed effects regression model is 


Wig = Pig T pee C T Ue (10.3) 


with i = 1,...,n and t = 1,...,T. The a; are entity-specific intercepts 
that capture heterogeneities across entities. An equivalent representation 
of this model is given by 


Vig = o T BiA ie te A T 22, T 33, In Dn, ea 
(10.4) 


where the D2;, D3;,..., Dn; are dummy variables. 





Estimation and Inference 


Software packages use a so-called “entity-demeaned” OLS algorithm which is 
computationally more efficient than estimating regression models with k + n 
regressors as needed for models (10.3) and (10.4). 


Taking averages on both sides of (10.1) we obtain 


i gee ee ii 
Yit = Pi — Xi a iS i 
Y =ßıX; + ai + %. 
Subtraction from (10.1) yields 
Yit — Yi = G1 (Xie — Xi) + (uit — T) 


i K (10.5) 
Yit = b1X it + Uit. 
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In this model, the OLS estimate of the parameter of interest 3; is equal to the 
estimate obtained using (10.2) — without the need to estimate n — 1 dummies 
and an intercept. 


We conclude that there are two ways of estimating 61 in the fixed effects regres- 
sion: 


1. OLS of the dummy regression model as shown in (10.2) 


2. OLS using the entity demeaned data as in (10.5) 


Provided the fixed effects regression assumptions stated in Key Concept 10.3 
hold, the sampling distribution of the OLS estimator in the fixed effects regres- 
sion model is normal in large samples. The variance of the estimates can be esti- 
mated and we can compute standard errors, t-statistics and confidence intervals 
for coefficients. In the next section, we see how to estimate a fixed effects model 
using R and how to obtain a model summary that reports heteroskedasticity- 
robust standard errors. We leave aside complicated formulas of the estimators. 
See Chapter 10.5 and Appendix 10.2 of the book for a discussion of theoretical 
aspects. 


Application to Traffic Deaths 


Following Key Concept 10.2, the simple fixed effects model for estimation of the 
relation between traffic fatality rates and the beer taxes is 


Fatality Ratei = Bı BeerTax;, + StateFixedE f fects + uit, (10.6) 


a regression of the traffic fatality rate on beer tax and 48 binary regressors — 
one for each federal state. 


We can simply use the function 1m() to obtain an estimate of 61. 


fatal_fe_1lm_mod <- lm(fatal_rate ~ beertax + state - 1, data = Fatalities) 
fatal_fe_lm_ mod 

#> 

#> Call: 

#> lm(formula = fatal_rate ~ beertax + state - 1, data = Fatalities) 

#> 

#> Coefficients: 

#> beertax stateal stateaz statear stateca stateco statect statede 
#> -0.6559 3.4776 2.9099 2.8227 1.9682 179933 16157 m 2.1700 
#> statefl statega stateid stateil statein statera stateks stateky 
#> 3.2095 4.0022 2.8086 1.5160 2.0161 1.9337 2.2544 2.2601 
#> statela stateme statemd statema statemi statemn statems statemo 
#2 276305 236097 EI T712 EIT 19931 1.5804 3.4486 2.1814 
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#> statemt statene statenu statenh statenj7 statenm stateny statenc 
#> 3.1172 1.9555 2.8769 2.2232 1.3719 3.9040 1.2910 3.1872 
#> statend stateoh stateok stateor statepa stateri statesc statesd 
#> 1.8542 1.8032 2.9326 2.3096 1.7102 1.2126 4.0348 2.4739 
#> statetn statetx stateut statevt stateva statewa statewu statewt 
#> 220020 2 ,0602 2.3131 (2.0116 2518740 176181 275809 1.7184 
#> statewy 

#> 3.2491 


As discussed in the previous section, it is also possible to estimate 6, by applying 
OLS to the demeaned data, that is, to run the regression 


FatalityRate = pı BeerTaxj, + uit. 


# obtain demeaned data 

Fatalities _demeaned <- with(Fatalities, 
data.frame(fatal_rate = fatal_rate - ave(fatal_rate, state), 
beertax = beertax - ave(beertax, state))) 


# estimate the regression 
summary (1m(fatal_rate ~ beertax - 1, data = Fatalities_demeaned) ) 


The function ave is convenient for computing group averages. We use it to 
obtain state specific averages of the fatality rate and the beer tax. 


Alternatively one may use plm() from the package with the same name. 


# install and load the 'plm' package 
## install. packages ("plm") 
library (plm) 


As for 1m() we have to specify the regression formula and the data to be used 
in our call of plm(). Additionally, it is required to pass a vector of names of 
entity and time ID variables to the argument index. For Fatalities, the ID 
variable for entities is named state and the time id variable is year. Since 
the fixed effects estimator is also called the within estimator, we set model = 
"within". Finally, the function coeftest() allows to obtain inference based 
on robust standard errors. 


# estimate the fixed effects regression with plm() 
fatal_fe_mod <- plm(fatal_rate ~ beertax, 
data = Fatalities, 
index = c("state", "year"), 
model = "within") 
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# print summary using robust standard errors 
coeftest(fatal_fe_mod, vcov. = vcovHC, type = "HCi") 
#> 

#> t test of coefficients: 


#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> beertax -0.65587 0.28880 -2.271 0.02388 * 

See 

#> Stgntf. codes: 0 '***' 0.001 ‘**' 0.01 '*" 0.05 '.' 0.1 ' " 


The estimated coefficient is again —0.6559. Note that plm() uses the entity- 
demeaned OLS algorithm and thus does not report dummy coefficients. The 
estimated regression function is 


FatalityRate = — 0.66 x BeerTax + StateFixedE f fects. (10.7) 


The coefficient on BeerTax is negative and significant. The interpretation is 
that the estimated reduction in traffic fatalities due to an increase in the real 
beer tax by $1 is 0.66 per 10000 people, which is still pretty high. Although 
including state fixed effects eliminates the risk of a bias due to omitted factors 
that vary across states but not over time, we suspect that there are other omitted 
variables that vary over time and thus cause a bias. 


10.4 Regression with Time Fixed Effects 


Controlling for variables that are constant across entities but vary over time can 
be done by including time fixed effects. If there are only time fixed effects, the 
fixed effects regression model becomes 





Yit = bo + Pi Xie + 62B2, +--+ + bp BT, + Uit, 


where only T—1 dummies are included (B1 is omitted) since the model includes 
an intercept. This model eliminates omitted variable bias caused by excluding 
unobserved variables that evolve over time but are constant across entities. 


In some applications it is meaningful to include both entity and time fixed 
effects. The entity and time fixed effects model is 








Yit = Po + Bi Xie + Y2D2; + +++ + In DT; + 62B2; +--+ + br BT; + wit. 


The combined model allows to eliminate bias from unobservables that change 
over time but are constant over entities and it controls for factors that differ 
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across entities but are constant over time. Such models can be estimated using 
the OLS algorithm that is implemented in R. 


The following code chunk shows how to estimate the combined entity and time 
fixed effects model of the relation between fatalities and beer tax, 


FatalityRate,, = 6, BeerTax,, + StateE f fects + TimeFixedE f fects + uit 


using both 1m() and plm(). It is straightforward to estimate this regression 
with 1m() since it is just an extension of (10.6) so we only have to adjust the 
formula argument by adding the additional regressor year for time fixed effects. 
In our call of plm() we set another argument effect = "twoways" for inclusion 
of entity and time dummies. 


# estimate a combined time and entity fixed effects regression model 


# via mO 

fatal_tefe_lm_mod <- lm(fatal_rate ~ beertax + state + year - 1, data = Fatalities) 
fatal_tefe_1lm_mod 

#> 

#> Call: 

#> lm(formula = fatal_rate ~ beertaz + state + year - 1, data = Fatalities) 

#> 

#> Coefficients: 

#> beertaxr stateal stateaz  statear  stateca  stateco statect statede 
#> -0.63998 3.51137 2.96451 2.87284 2.02618 2.04984 1.67125 2.22711 
#> statefl statega stateid  stateil statein stateia stateks stateky 
#> 3.25132 4.02300 2.86242 1.57287 2.07123 1.98709 2.30707 2.31659 
#> statela stateme statemd  statema statemi statemn statems statemo 
#> 2.67772 2.41713 1.82731 1.42335 2.04488 1.63488 3.49146 2.23598 
#> statemt statene statenu  statenh staten] statenm  stateny  statenc 
#> 3.17160 2.00846 2.93322 2.27245 1.43016 3.95748 1.34849 3.22630 
#> statend stateoh stateok stateor statepa  stateri statesc statesd 
#> 1.90762 1.85664 2.97776 2.36597 1.76563 1.26964 4.06496 2.52317 
#> statetn statetx stateut stateut stateva statewa statewv statewt 
#> 2.65670 2.61282 2.36165 2.56100 2.23618 1.87424 2.63364 1.77545 
#> statewy year1983 year1984 year1985 year1986 year1987 year1988 

#> 3.30791 -0.07990 -0.07242 -0.12398 -0.03786 -0.05090 -0.05180 


# via plm) 
fatal_tefe_mod <- plm(fatal_rate ~ beertax, 
data = Fatalities, 


index = c("state", "year"), 
model = "within", 
effect = "twoways") 


coeftest (fatal_tefe_mod, vcov = vcovHC, type = "HCi") 
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#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> beertax -0.63998 035015 186277 ON 06865 

#> --- 

HV OUgGniT. codes: O Crkt TOTOOT Ck MO OIM MO OSTEO IA] 


Before discussing the outcomes we convince ourselves that state and year are 
of the class factor . 

# check the class of 'state' and ‘year! 
class (Fatalities$state) 

sea VR factor 

class (Fatalities$year) 

se VRI UROA 


The 1m() functions converts factors into dummies automatically. Since we ex- 
clude the intercept by adding -1 to the right-hand side of the regression formula, 
1mQ estimates coefficients for n + (T — 1) = 48 +6 = 54 binary variables (6 
year dummies and 48 state dummies). Again, plmQ only reports the estimated 
coefficient on BeerTaz. 


The estimated regression function is 


FatalityRate = ee x BeerTax + StateE f fects + TimeFixedE f fects. 
0.35 
(10.8) 


The result —0.66 is close to the estimated coefficient for the regression model 
including only entity fixed effects. Unsurprisingly, the coefficient is less precisely 
estimated but significantly different from zero at 10%. 


In view of (10.7) and (10.8) we conclude that the estimated relationship between 
traffic fatalities and the real beer tax is not affected by omitted variable bias 
due to factors that are constant over time. 


10.5 The Fixed Effects Regression Assumptions 
and Standard Errors for Fixed Effects Re- 
gression 


This section focuses on the entity fixed effects model and presents model as- 
sumptions that need to hold in order for OLS to produce unbiased estimates 
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that are normally distributed in large samples. These assumptions are an ex- 
tension of the assumptions made for the multiple regression model (see Key 
Concept 6.4) and are given in Key Concept 10.3. We also briefly discuss stan- 
dard errors in fixed effects models which differ from standard errors in multiple 
regression as the regression error can exhibit serial correlation in panel models. 


Key Concept 10.3 
The Fixed Effects Regression Assumptions 


In the fixed effects regression model 
Wie = hA toi tue , t=1,... 


we assume the following: 


. The error term uj, has conditional mean zero, that is, 
CAG one 2 ea)) 


. (Xi, Xie, -.., Xia, Uil- wir), i= 1,...,n areiid. draws from 
their joint distribution. 


. Large outliers are unlikely, i.e., (Xit, uit) have nonzero finite fourth 
moments. 


. There is no perfect multicollinearity. 


are multiple regressors, X; is replaced by 
soy Vag 


? 





The first assumption is that the error is uncorrelated with all observations of 
the variable X for the entity i over time. If this assumption is violated, we 
face omitted variables bias. The second assumption ensures that variables are 
iid. across entities i= 1,...,n. This does not require the observations to be 
uncorrelated within an entity. The X;; are allowed to be autocorrelated within 
entities. This is a common property of time series data. The same is allowed 
for errors uit. Consult Chapter 10.5 of the book for a detailed explanation for 
why autocorrelation is plausible in panel applications. The second assumption 
is justified if the entities are selected by simple random sampling. The third 
and fourth assumptions are analogous to the multiple regression assumptions 
made in Key Concept 6.4. 
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Standard Errors for Fixed Effects Regression 


Similar as for heteroskedasticity, autocorrelation invalidates the usual standard 
error formulas as well as heteroskedasticity-robust standard errors since these 
are derived under the assumption that there is no autocorrelation. When there 
is both heteroskedasticity and autocorrelation so-called heteroskedasticity and 
autocorrelation-consistent (HAC) standard errors need to be used. Clustered 
standard errors belong to these type of standard errors. They allow for het- 
eroskedasticity and autocorrelated errors within an entity but not correlation 
across entities. 


As shown in the examples throughout this chapter, it is fairly easy to specify 
usage of clustered standard errors in regression summaries produced by function 
like coeftest() in conjunction with vcovHC() from the package sandwich. 
Conveniently, vcovHC() recognizes panel model objects (objects of class plm) 
and computes clustered standard errors by default. 


The regressions conducted in this chapter are a good examples for why usage 
of clustered standard errors is crucial in empirical applications of fixed effects 
models. For example, consider the entity and time fixed effects model for fatal- 
ities. Since fatal_tefe_1m_mod is an object of class lm, coeftest() does not 
compute clustered standard errors but uses robust standard errors that are only 
valid in the absence of autocorrelated errors. 


# check class of the model object 
class(fatal_tefe_1m_mod) 
ea O 


# obtain a summary based on heteroskedasticity-robust standard errors 
# (no adjustment for heteroskedasticity only) 
coeftest(fatal_tefe_lm_mod, vcov = vcovHC, type = "HCi")[1, ] 

#> Estimate Std. Error t value Pr(>/t/) 

#> -0.6399800 0.2547149 -2.5125346 0.012570 


# check class of the (plm) model object 
class(fatal_tefe_mod) 
#> [1] "plm" "panelmodel" 


# obtain a summary based on clusterd standard errors 
# (adjustment for autocorrelation + heteroskedasticity) 
coeftest(fatal_tefe_mod, vcov = vcovHC, type = "HC1") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> beertaz -0.63998 0.35015 -1.8277 0.06865 . 
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#> --- 
#> Sugnif. codes; 0 raat 0.001 ‘ae! 0.01 te! 0.05") O11 1 14 


The outcomes differ rather strongly: imposing no autocorrelation we obtain 
a standard error of 0.25 which implies significance of Bi, the coefficient on 
BeerTax at the level of 5%. On the contrary, using the clustered standard 
error 0.35 leads to acceptance of the hypothesis Ho : 6; = 0 at the same level, 
see equation (10.8). Consult Appendix 10.2 of the book for insights on the 
computation of clustered standard errors. 


10.6 Drunk Driving Laws and Traffic Deaths 


There are two major sources of omitted variable bias that are not accounted for 
by all of the models of the relation between traffic fatalities and beer taxes that 
we have considered so far: economic conditions and driving laws. Fortunately, 
Fatalities has data on state-specific legal drinking age (drinkage), punish- 
ment (jail, service) and various economic indicators like unemployment rate 
(unemp) and per capita income (income). We may use these covariates to extend 
the preceding analysis. 


These covariates are defined as follows: 


e unemp: a numeric variable stating the state specific unemployment rate. 
e log(income): the logarithm of real per capita income (in prices of 1988). 
e miles: the state average miles per driver. 

e drinkage: the state specify minimum legal drinking age. 

e drinkagc: a discretized version of drinkage that classifies states into four 
categories of minimal drinking age; 18, 19, 20, 21 and older. R denotes 
this as [18,19), [19,20), [20,21) and [21,22]. These categories are 
included as dummy regressors where [21,22] is chosen as the reference 
category. 

e punish: a dummy variable with levels yes and no that measures if drunk 
driving is severely punished by mandatory jail time or mandatory com- 
munity service (first conviction). 


At first, we define the variables according to the regression results presented in 
Table 10.1 of the book. 


# discretize the minimum legal drinking age 

Fatalities$drinkagec <- cut(Fatalities$drinkage, 
breaks = 18:22, 
include.lowest = TRUE, 
right = FALSE) 
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# set minimum drinking age [21, 22] to be the baseline level 
Fatalities$drinkagec <- relevel(Fatalities$drinkagec, "[21,22]") 


# mandadory jail or community service? 
Fatalities$punish <- with(Fatalities, factor(jail == "yes" | service == 
labels = c("no", "yes"))) 


yes", 


# the set of observations on all variables for 1982 and 1988 
Fatalities_1982_1988 <- Fatalities[with(Fatalities, year == 1982 | year == 1988), ] 


Next, we estimate all seven models using plm(). 


# estimate all seven models 
fatalities_mod1 <- lm(fatal_rate ~ beertax, data = Fatalities) 


fatalities_mod2 <- plm(fatal_rate ~ beertax + state, data = Fatalities) 


fatalities_mod3 <- plm(fatal_rate ~ beertax + state + year, 
index = c("state","year"), 
model = "within", 
effect = "twoways", 
data = Fatalities) 


fatalities_mod4 <- plm(fatal_rate ~ beertax + state + year + drinkagec 
+ punish + miles + unemp + log(income), 


index = c("state", "year"), 
model = "within", 
effect = "twoways", 


data = Fatalities) 


fatalities_mod5 <- plm(fatal_rate ~ beertax + state + year + drinkagec 
+ punish + miles, 


index = c("state", "year"), 
model = "within", 
effect = "twoways", 


data = Fatalities) 


fatalities_mod6 <- plm(fatal_rate ~ beertax + year + drinkage 
+ punish + miles + unemp + log(income), 


index = c("state", "year"), 
model = "within", 
effect = "twoways", 


data = Fatalities) 


fatalities_mod7 <- plm(fatal_rate ~ beertax + state + year + drinkagec 
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+ punish + miles + unemp + log(income), 
index = c("state", "year"), 

model = "within", 

effect = "twoways", 

data = Fatalities _1982 1988) 


We again use stargazer() (Hlavac, 2018) to generate a comprehensive tabular 
presentation of the results. 


library (stargazer) 


# gather clustered standard errors ina list 

rob_se <- list (sqrt (diag(vcovHC(fatalities_mod1, type = "HCi"))), 
sqrt (diag (vcovHC(fatalities_mod2, type = "HCi"))), 
sqrt (diag (vcovHC(fatalities_mod3, type = "HCi"))), 
sqrt (diag (vcovHC(fatalities_mod4, type = "HCi"))), 
sqrt (diag (vcovHC(fatalities_mod5, type = "HCi"))), 
sqrt (diag(vcovHC(fatalities_mod6, type = "HCi"))), 
sqrt (diag (vcovHC(fatalities_mod7, type = "HCi")))) 


# generate the table 
stargazer(fatalities_modi, fatalities_mod2, fatalities_mod3, 
fatalities_mod4, fatalities_mod5, fatalities_mod6, fatalities_mod7, 
digits = 3, 
header = FALSE, 
type = “latex, 
se = rob_se, 
title = "Linear Panel Regression Models of Traffic Fatalities due to Drunk D: 
model.numbers = FALSE, 
colmi labe l'SI- ic GG) trae C2) iy (3) Ura) a CoiuneandG)", "C7)")) 


While columns (2) and (3) recap the results (10.7) and (10.8), column (1) 
presents an estimate of the coefficient of interest in the naive OLS regression 
of the fatality rate on beer tax without any fixed effects. We obtain a positive 
estimate for the coefficient on beer tax that is likely to be upward biased. The 
model fit is rather bad, too (R? = 0.091). The sign of the estimate changes as 
we extend the model by both entity and time fixed effects in models (2) and (3). 
Furthermore R? increases substantially as fixed effects are included in the model 
equation. Nonetheless, as discussed before, the magnitudes of both estimates 
may be too large. 


The model specifications (4) to (7) include covariates that shall capture the effect 
of overall state economic conditions as well as the legal framework. Considering 
(4) as the baseline specification, we observe four interesting results: 
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1. Including the covariates does not lead to a major reduction of the esti- 
mated effect of the beer tax. The coefficient is not significantly different 
from zero at the level of 5% as the estimate is rather imprecise. 


2. The minimum legal drinking age does not have an effect on traffic fatalities: 
none of the three dummy variables are significantly different from zero 
at any common level of significance. Moreover, an F-Test of the joint 
hypothesis that all three coefficients are zero does not reject. The next 
code chunk shows how to test this hypothesis. 


# test if legal drinking age has no explanatory power 
linearHypothesis(fatalities_mod4, 
test = "F", 
c("drinkagec[18,19)=0", "“drinkagec[19,20)=0", "drinkagec[20,21)"), 
vcov. = vcovHC, type = "HC1i") 
#> Linear hypothesis test 
#> 
#> Hypothesis: 
#> drinkagec[18,19) = 0 
#> drinkagec[19,20) = 0 
#> drinkagec[20,21) = 0 
#> 
#> Model 1: restricted model 
#> Model 2: fatal rate ~ beertax + state + year + drinkagec + punish + miles + 


#> unemp + lLog(income) 

#> 

#> Note: Coefficient covariance matrix supplied. 
#> 

#> Res.Df Df EERTE) 

#> 1 276 


#> 2 273 3 0.3782 0.7688 


3. There is no evidence that punishment for first offenders has a deterring 
effects on drunk driving: the corresponding coefficient is not significant at 
the 10% level. 


4. The economic variables significantly explain traffic fatalities. We can check 
that the employment rate and per capita income are jointly significant at 
the level of 0.1%. 


# test if economic indicators have no explanatory power 
linearHypothesis(fatalities_mod4, 

test = "F", 

c("log(income)", "unemp"), 

vcov. = vcovHC, type = "HCi") 
#> Linear hypothesis test 
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#> 

#> Hypothesis: 

#> Log(income) = 0 

#> unemp = 0 

#> 

#> Model 1: restricted model 

#> Model 2: fatal_rate ~ beertax + state + year + drinkagec + punish + miles + 


#> unemp + log(income) 

#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df F  Pr(>F) 

#> 1 275 

#> 2 273 2 31.577 4.609e-13 #** 

#> --- 

Ao Signoj- COGS HO) ELE TOTOO EAA SOMO US OOS cs TOR ee ed 


Model (5) omits the economic factors. The result supports the notion that 
economic indicators should remain in the model as the coefficient on beer tax 
is sensitive to the inclusion of the latter. 


Results for model (6) demonstrate that the legal drinking age has little explana- 
tory power and that the coefficient of interest is not sensitive to changes in the 
functional form of the relation between drinking age and traffic fatalities. 


Specification (7) reveals that reducing the amount of available information (we 
only use 95 observations for the period 1982 to 1988 here) inflates standard 
errors but does not lead to drastic changes in coefficient estimates. 


Summary 


We have not found evidence that severe punishments and increasing the min- 
imum drinking age reduce traffic fatalities due to drunk driving. Nonetheless, 
there seems to be a negative effect of alcohol taxes on traffic fatalities which, 
however, is estimated imprecisely and cannot be interpreted as the causal effect 
of interest as there still may be a bias. The issue is that there may be omitted 
variables that differ across states and change over time and this bias remains 
even though we use a panel approach that controls for entity specific and time 
invariant unobservables. 


A powerful method that can be used if common panel regression approaches fail 
is instrumental variables regression. We will return to this concept in Chapter 
12. 
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10.7 Exercises 


This interactive part of the book is only available in the HTML version. 


Chapter 11 


Regression with a Binary 
Dependent Variable 


This chapter, we discusses a special class of regression models that aim to ex- 
plain a limited dependent variable. In particular, we consider models where the 
dependent variable is binary. We will see that in such models, the regression 
function can be interpreted as a conditional probability function of the binary 
dependent variable. 


We review the following concepts: 


e the linear probability model 

e the Probit model 

e the Logit model 

e maximum likelihood estimation of nonlinear regression models 


Of course, we will also see how to estimate above models using R and discuss an 
application where we examine the question whether there is racial discrimination 
in the U.S. mortgage market. 


The following packages and their dependencies are needed for reproduction of 


the code chunks presented throughout this chapter on your computer: 


e AER (Kleiber and Zeileis, 2020) 
e stargazer (Hlavac, 2018) 


Check whether the following code chunk runs without any errors. 
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library (AER) 
library (stargazer) 


11.1 Binary Dependent Variables and the Lin- 
ear Probability Model 


Key Concept 11.1 
The Linear Probability Model 
The linear regression model 
Yi = Bo + Bi + Xii + BoXai +-+- + OkXki + ui 


with a binary dependent variable Y; is called the linear probability model. 
In the linear probability model we have 


EV, Koe = A = ae) 


where 





IA = WA Raa ea A] = Go + 1 Nay a e o e a: 


Thus, 8; can be interpreted as the change in the probability that Y; = 1, 
holding constant the other k — 1 regressors. Just as in common multiple 
regression, the 8; can be estimated using OLS and the robust standard 
error formulas can be used for hypothesis testing and computation of 
confidence intervals. 


In most linear probability models, R? has no meaningful interpretation 
since the regression line can never fit the data perfectly if the dependent 
variable is binary and the regressors are continuous. This can be seen 
in the application below. 


It is essential to use robust standard errors since the u; in a linear 
probability model are always heteroskedastic. 


Linear probability models are easily estimated in R using the function 
1m(). 





Mortgage Data 


Following the book, we start by loading the data set HMDA which provides data 
that relate to mortgage applications filed in Boston in the year of 1990. 
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# load `AER` package and attach the HMDA data 
library (AER) 
data (HMDA) 


We continue by inspecting the first few observations and compute summary 
statistics afterwards. 


# inspect the data 


head (HMDA) 

#> deny pirat hirat lurat chist mhist phist unemp selfemp insurance condomin 
#> 1 no 0.221 0.221 0.8000000 5 2 mo 3.9 no no no 
#> 2 no 0.265 0.265 0.9218750 2 2 no 3.2 no no no 
#> 3 no 0.372 0.248 0.9203980 1 2 no 3.2 no no no 
#> 4 no 0.320 0.250 0.8604651 1 2 no 4-3 no no no 
#> 5 no 0.360 0.350 0.6000000 1 1 no sine no no no 
#> 6 no 0.240 0.170 0.5105263 1 1 mo 3.9 no no no 
#> afam single hschool 

#> 1 no no yes 

#> 2 no yes yes 

#> 3 no no yes 

#> 4 no no yes 

#> 5 no no yes 

#> 6 no no yes 

summary (HMDA) 

#> deny pirat hirat lurat chist 

#> no :2095 Min. :0.0000 Min. :0.0000 Min. 10.020071 : 1353 

#> yes: 285 ist Qu. 0 2800m]St Qu -0-2120 ist Qu 206527 T2: 441 

#> Median :0.3300 Median :0.2600 Median :0.7795 3: 126 

#> Mean :0.3308 Mean [0.2553 Mean OST 4: TT 

#> srd Qu.:0.3700 3rd Qu.:0.2988 3rd Qu.:0.8685 5: 182 

#> Maz. :3.0000 Maz. :3.0000 Maz. 11.9500 6: 201 

#> mhist phist unemp selfemp insurance condomin 

#> 1: 747 no :2205 Min. / 1.800 no :2103 no :2332 no :1694 

#> 2:1571 yes: 175 ist Qu.: 3.100 yes: 277 yes: 48 yes: 686 

#> 3: 41 Median : 3.200 

#> 4: 21 Mean ln (eA 

#> ond Gite 3-900. 

#> Maz. 10.600 

#> afam single hschool 


#> no -20721 no 1777 no: 39 
#> yes: 339 yes: 936 yes:2341 
#> 
#> 
#> 
#> 
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The variable we are interested in modelling is deny, an indicator for whether 
an applicant’s mortgage application has been accepted (deny = no) or denied 
(deny = yes). A regressor that ought to have power in explaining whether a 
mortgage application has been denied is pirat, the size of the anticipated total 
monthly loan payments relative to the the applicant’s income. It is straightfor- 
ward to translate this into the simple regression model 


deny = bo + bı x P/I ratio+ u. (11.1) 


We estimate this model just as any other linear regression model using 1m(). 
Before we do so, the variable deny must be converted to a numeric variable 
using as.numeric() as 1lm() does not accepts the dependent variable to be of 
class factor. Note that as.numeric(HMDA$deny) will turn deny = no into 
deny = 1 and deny = yes into deny = 2, so using as. numeric (HMDA$deny)-1 
we obtain the values 0 and 1. 


# convert ‘deny' to numeric 
HMDA$deny <- as.numeric(HMDA$deny) - 1 


# estimate a simple linear probabilty model 
denymodi <- lm(deny ~ pirat, data = HMDA) 
denymodi 

#> 

#> Call: 

#> lm(formula = deny ~ pirat, data = HMDA) 
#> 

#> Coefficients: 

#> (Intercept) pirat 

#> -0.07991 0.60353 


Next, we plot the data and the regression line to reproduce Figure 11.1 of the 
book. 


# plot the data 
plot(x = HMDA$pirat, 
y = HMDA$deny, 
main = "Scatterplot Mortgage Application Denial and the Payment-to-Income Ratio", 
xlab = "P/I ratio", 
ylab = "Deny", 
peni = 20; 
ylim = c(-0.4, 1.4), 
cex.main = 0.8) 


# add horizontal dashed lines and text 
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abline(h = 1, lty = 2, col = "darkred") 
abline(h = 0, lty = 2, col = "darkred") 
text(2.5, 0.9, cex = 0.8, "Mortgage denied") 
text(2.5, -0.1, cex= 0.8, "Mortgage approved") 


# add the estimated regression line 


abline(denymod1, 
lwd = 1.8, 
col = "steelblue") 


Scatterplot Mortgage Application Denial and the Payment-to-Income Ratio 
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P/I ratio 


According to the estimated model, a payment-to-income ratio of 1 is associated 
with an expected probability of mortgage application denial of roughly 50%. 
The model indicates that there is a positive relation between the payment-to- 
income ratio and the probability of a denied mortgage application so individuals 
with a high ratio of loan payments to income are more likely to be rejected. 


We may use coeftest() to obtain robust standard errors for both coefficient 
estimates. 


# print robust coefficient summary 
coeftest(denymod1, vcov. = vcovHC, type = "HC1i") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) -0.079910 OOE RAIE 254998 0.01249 * 
#> pirat 0.603535 0.098483 6.1283 1.036e-09 *** 
a 


Wo orgni- codes OC k*k 07001 "Fe" 0701s" 0705 "2 0,1"! 
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The estimated regression line is 


deny = —0.080 + 0.604 P/I ratio. (11.2) 
(0.032) (0.098) 


The true coefficient on P/I ratio is statistically different from 0 at the 1% level. 
Its estimate can be interpreted as follows: a 1 percentage point increase in 
P/T ratio leads to an increase in the probability of a loan denial by 0.604-0.01 = 
0.00604 ~ 0.6%. 


Following the book we augment the simple model (11.1) by an additional regres- 
sor black which equals 1 if the applicant is an African American and equals 0 
otherwise. Such a specification is the baseline for investigating if there is racial 
discrimination in the mortgage market: if being black has a significant (positive) 
influence on the probability of a loan denial when we control for factors that 
allow for an objective assessment of an applicants credit worthiness, this is an 
indicator for discrimination. 


# rename the variable 'afam' for consistency 
colnames (HMDA) [colnames (HMDA) == "afam"] <- "black" 


# estimate the model 


denymod2 <- Im(deny ~ pirat + black, data = HMDA) 
coeftest(denymod2, vcov. = vcovHC) 
#> 
#> t test of coefficients: 
#> 
#> Estimate Std. Error t value Pr(>/t/) 
#> (Intercept) -0.090514 0.033430 -2.7076 0.006826 ** 
#> pirat 0.559195 0.103671 5.3939 '7.575e-08 *** 
#> blackyes 0.177428 0.025055 ‘7.0815 1.871e-12 *** 
#> --- 
#> otgntf. ‘codes: OR kE 07001 Veet 0701 Fe! O05 051 Ya 
The estimated regression function is 
deny = — 0.091 + 0.559P/I ratio + 0.177black. (11.3) 
(0.029) (0.089) (0.025) 


The coefficient on black is positive and significantly different from zero at the 
0.01% level. The interpretation is that, holding constant the P/I ratio, be- 
ing black increases the probability of a mortgage application denial by about 
17.7%. This finding is compatible with racial discrimination. However, it might 
be distorted by omitted variable bias so discrimination could be a premature 
conclusion. 
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11.2 Probit and Logit Regression 


The linear probability model has a major flaw: it assumes the conditional prob- 
ability function to be linear. This does not restrict P(Y = 1|X1,..., Xx) to lie 
between 0 and 1. We can easily see this in our reproduction of Figure 11.1 of 
the book: for P/I ratio > 1.75, (11.2) predicts the probability of a mortgage 
application denial to be bigger than 1. For applications with P/I ratio close to 
0, the predicted probability of denial is even negative so that the model has no 
meaningful interpretation here. 


This circumstance calls for an approach that uses a nonlinear function to model 
the conditional probability function of a binary dependent variable. Commonly 
used methods are Probit and Logit regression. 


Probit Regression 


In Probit regression, the cumulative standard normal distribution function ®(-) 
is used to model the regression function when the dependent variable is binary, 
that is, we assume 





E(Y|X) = P(Y =1|X) = 8(6) + BX). (11.4) 





Go + 81X in (11.4) plays the role of a quantile z. Remember that 
®(z) =P(Z<z), Z~ N(0,1) 


such that the Probit coefficient 5 in (11.4) is the change in z associated with a 
one unit change in X. Although the effect on z of a change in X is linear, the 
link between z and the dependent variable Y is nonlinear since ® is a nonlinear 
function of X. 


Since the dependent variable is a nonlinear function of the regressors, the coef- 
ficient on X has no simple interpretation. According to Key Concept 8.1, the 
expected change in the probability that Y = 1 due to a change in P/I ratio can 
be computed as follows: 


1. Compute the predicted probability that Y = 1 for the original value of X. 
2. Compute the predicted probability that Y = 1 for X + AX. 
3. Compute the difference between both predicted probabilities. 


Of course we can generalize (11.4) to Probit regression with multiple regressors 
to mitigate the risk of facing omitted variable bias. Probit regression essentials 
are summarized in Key Concept 11.2. 
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Key Concept 11.2 
Probit Model, Predicted Probabilities and Estimated Effects 


Assume that Y is a binary variable. The model 
Y = Bo + Bit Xi + BoXe+---+ BeXe +u 
with 
P(Y = 1|X1, X2,..., Xk) = (bo + Bi + X1 + BoX2+ +--+ BeXx) 


is the population Probit model with multiple regressors X1, X2,..., Xk 
and ®(-) is the cumulative standard normal distribution function. 


The predicted probability that Y = 1 given X 1, Xo,...,X, can be 


calculated in two steps: 


1. Compute z = Bp + 61X1 + 6X: + BX, 


2. Look up ®(z) by calling pnorm(). 


j is the effect on z of a one unit change in regressor X;, holding 
constant all other k — 1 regressors. 


The effect on the predicted probability of a change in a regressor can be 
computed as in Key Concept 8.1. 


In R, Probit models can be estimated using the function glm() from the 
package stats. Using the argument family we specify that we want to 
use a Probit link function. 





We now estimate a simple Probit model of the probability of a mortgage denial. 


# estimate the simple probit model 

denyprobit <- glm(deny ~ pirat, 
family = binomial(link = "probit"), 
data = HMDA) 


coeftest (denyprobit, vcov. = vcovHC, type = "HC1i") 


#> 

#> z test of coefficients: 

#> 

#> Estimate Std. Error z value Pr(>/z/) 


#> (Intercept) -2.19415 0.18901 -11.6087 < 2.2e-16 *** 
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#> pirat 2.96787 0.53698 5.5269 3.259e-08 *** 
#. << 
#> Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


The estimated model is 


P(deny|P/I ratio) = ®(— 2.19 + 2.97 P/I ratio). (11.5) 


Just as in the linear probability model we find that the relation between the 
probability of denial and the payments-to-income ratio is positive and that the 
corresponding coefficient is highly significant. 


The following code chunk reproduces Figure 11.2 of the book. 


# plot data 
plot(x = HMDA$pirat, 
y = HMDA$deny, 


main = "Probit Model of the Probability of Denial, Given P/I Ratio", 
Slab = P/I rataol, 

ylab = "Deny", 

pch = 20, 

ylim = c(-0.4, 1.4), 

cex.main = 0.85) 


# add horizontal dashed lines and teat 
abline(h = 1, lty = 2, col = "darkred") 
abline(h = 0, lty = 2, col = "darkred") 
text(2.5, 0.9, cex = 0.8, "Mortgage denied") 
text(2.5, -0.1, cex= 0.8, "Mortgage approved") 


# add estimated regression line 
x <- seq(0, 3, 0.01) 
y <- predict(denyprobit, list(pirat = x), type = "response") 


lines(x, y, lwd = 1.5, col = "steelblue") 
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Probit Model of the Probability of Denial, Given P/I Ratio 
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The estimated regression function has a stretched “S”-shape which is typical for 
the CDF of a continuous random variable with symmetric PDF like that of a 
normal random variable. The function is clearly nonlinear and flattens out for 
large and small values of P/I ratio. The functional form thus also ensures that 
the predicted conditional probabilities of a denial lie between 0 and 1. 


We use predict() to compute the predicted change in the denial probability 
when P/I ratio is increased from 0.3 to 0.4. 


# 1. compute predictions for P/I ratio = 0.3, 0.4 

predictions <- predict(denyprobit, 
newdata = data.frame("pirat" = c(0.3, 0.4)), 
type = "response") 


# 2. Compute difference in probabilities 
diff (predictions) 

#> 2 

#> 0.06081433 


We find that an increase in the payment-to-income ratio from 0.3 to 0.4 is 
predicted to increase the probability of denial by approximately 6.2%. 


We continue by using an augmented Probit model to estimate the effect of race 
on the probability of a mortgage application denial. 


denyprobit2 <- glm(deny ~ pirat + black, 
family = binomial(link = "probit"), 
data = HMDA) 


coeftest (denyprobit2, vcov. = vcovHC, type = "HCi") 
#> 
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#> z test of coefficients: 


#> 

#> Estimate Std. Error z value Pr(>/z2/) 

#> (Intercept) -2.258787 0.176608 -12.'7898 < 2.2e-16 *** 

#> pirat 2.741779 0.497673 5.5092 3.605e-08 *** 

#> blackyes 0.708155 0.083091 8.5227 < 2.2e-16 *** 

#> --- 

#> Signife codes; 0 '***! 07001 Vee! O O] Ye! O05 "2" 0.1 YN A 


The estimated model equation is 


P(deny|P/T ratio, black) = ®(—2.26 + 2.74 P/I ratio + 0.71black). (11.6) 
(0.18) (0.50) (0.08) 


While all coefficients are highly significant, both the estimated coefficients on 
the payments-to-income ratio and the indicator for African American descent 
are positive. Again, the coefficients are difficult to interpret but they indicate 
that, first, African Americans have a higher probability of denial than white ap- 
plicants, holding constant the payments-to-income ratio and second, applicants 
with a high payments-to-income ratio face a higher risk of being rejected. 


How big is the estimated difference in denial probabilities between two hypo- 
thetical applicants with the same payments-to-income ratio? As before, we may 
use predict () to compute this difference. 


#1. compute predictions for P/I ratio = 0.3 
predictions <- predict (denyprobit2, 
newdata = data.frame("black" = c("no", "yes"), 
"pirat" CONS RODDI, 


type = "response") 
# 2. compute difference in probabilities 
diff (predictions) 


#> 2 
#> 0.1578117 


In this case, the estimated difference in denial probabilities is about 15.8%. 


Logit Regression 


Key Concept 11.3 summarizes the Logit regression function. 
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Key Concept 11.3 
Logit Regression 


The population Logit regression function is 


P(Y =1|X), Xo,...,X~) =F (Bo + B1X1 + BoXo+-:- + B, Xx) 
1 








= 1 + e—(B0t81X1+BoX2+--+8e Xx) ` 


The idea is similar to Probit regression except that a different CDF is 


used: 
1 


a Her 
is the CDF of a standard logistically distributed random variable. 





As for Probit regression, there is no simple interpretation of the model coeffi- 
cients and it is best to consider predicted probabilities or differences in predicted 
probabilities. Here again, t-statistics and confidence intervals based on large 
sample normal approximations can be computed as usual. 


It is fairly easy to estimate a Logit regression model using R. 
denylogit <- glm(deny ~ pirat, 
family = binomial(link = "logit"), 
data = HMDA) 


coeftest(denylogit, vcov. = vcovHC, type = "HC1") 


#> 

#> z test of coefficients: 

#> 

#> Estimate Std. Error z value Pr(>/z2/) 

#> (Intercept) -4.02843 0.35898 -11.2218 < 2.2e-16 *** 

#> pirat 5.88450 1.00015 5.8836 4.014e-09 *** 

#> --- 

#> Signi- codes | OF EEK! OTOOTO OT I! OOS) fe On a] 


The subsequent code chunk reproduces Figure 11.3 of the book. 


# plot data 
plot(x = HMDA$pirat, 
y = HMDA$deny, 


main = "Probit and Logit Models Model of the Probability of Denial, Given P/I Rat: 
xlab = "P/I ratio", 
ylab = "Deny", 


11.2. PROBIT AND LOGIT REGRESSION 315 


pent = 20, 
ylim = c(-0.4, 1.4), 
cex.main = 0.9) 


# add horizontal dashed lines and text 
abline(h = 1, lty = 2, col = "darkred") 
abline(h = 0, lty = 2, col = "darkred") 
text(2.5, 0.9, cex = 0.8, "Mortgage denied") 
text(2.5, -0.1, cex= 0.8, "Mortgage approved") 


# add estimated regression line of Probit and Logit models 

x < sedl 3, 0.01) 

y_probit <- predict(denyprobit, list(pirat = x), type = "response") 
y_logit <- predict(denylogit, list(pirat = x), type = "response") 


lines(x, y_probit, lwd = 1.5, col = "steelblue") 
lines(x, y logit, lwd = 1.5, col = "black", lty = 2) 


# add a legend 

legend("topleft", 
horiz = TRUE, 
legend = c("Probit", "Logit"), 
col = c("steelblue", "black"), 
tey c(i. 2) 
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Both models produce very similar estimates of the probability that a mortgage 
application will be denied depending on the applicants payment-to-income ratio. 


Following the book we extend the simple Logit model of mortgage denial with 
the additional regressor black. 
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# estimate a Logit regression with multiple regressors 
denylogit2 <- glm(deny ~ pirat + black, 
family = binomial(link = "logit"), 
data = HMDA) 


coeftest (denylogit2, vcov. = vcovHC, type = "HC1i") 


#> 

#> z test of coefficients: 

#> 

#> Estimate Std. Error z value Pr(>/z/) 

#> (Intercept) -4.12556 0.34597 -11.9245 < 2.2e-16 *** 

#> pirat 5.37036 0.96376 5.5723 2.514e-08 *** 

#> blackyes 1.27278 0.14616 8.7081 < 2.Z2e-16 *** 

See 

HOU SUG COGS! TO xr OT 0O ex! AOU OL Se OOS Ol a! Saat 
We obtain 


P(deny = 1|P/Tratio, black) = F(—4.13 + 5.37 P/I ratio + 1.27 black). 
(0.35) (0.96) (0.15) 
(11.7) 


As for the Probit model (11.6) all model coefficients are highly significant and 
we obtain positive estimates for the coefficients on P/I ratio and black. For 
comparison we compute the predicted probability of denial for two hypothetical 
applicants that differ in race and have a P/I ratio of 0.3. 


# 1. compute predictions for P/I ratio = 0.3 
predictions <- predict (denylogit2, 


newdata = data.frame("black" = c("no", "yes"), 
“pirat = c03, 0-399, 
type = "response") 
predictions 
#> 1 2 


#> 0.07485143 0.22414592 


# 2. Compute difference in probabilities 
diff (predictions) 

#> 2 

#> 0.1492945 


We find that the white applicant faces a denial probability of only 7.5%, while 
the African American is rejected with a probability of 22.4%, a difference of 
14.9%. 


11.3. ESTIMATION AND INFERENCE IN THE LOGIT AND PROBIT MODELS317 


Comparison of the Models 


The Probit model and the Logit model deliver only approximations to the un- 
known population regression function E(Y|X). It is not obvious how to decide 
which model to use in practice. The linear probability model has the clear 
drawback of not being able to capture the nonlinear nature of the population 
regression function and it may predict probabilities to lie outside the interval 
[0,1]. Probit and Logit models are harder to interpret but capture the non- 
linearities better than the linear approach: both models produce predictions of 
probabilities that lie inside the interval [0, 1]. Predictions of all three models are 
often close to each other. The book suggests to use the method that is easiest 
to use in the statistical software of choice. As we have seen, it is equally easy 
to estimate Probit and Logit model using R. We can therefore give no general 
recommendation which method to use. 


11.3 Estimation and Inference in the Logit and 
Probit Models 


So far nothing has been said about how Logit and Probit models are estimated 
by statistical software. The reason why this is interesting is that both models 
are nonlinear in the parameters and thus cannot be estimated using OLS. In- 
stead one relies on maximum likelihood estimation (MLE). Another approach is 
estimation by nonlinear least squares (NLS). 


Nonlinear Least Squares 


Consider the multiple regression Probit model 


E(Yi| Xii,- - -3 Xri) = P(Y: = 1X, ..., Xki) = O(Bo + Gr X15 + +++ + bkXri)- 
(11.8) 


Similarly to OLS, NLS estimates the parameters 6o, (1,...,@% by minimizing 
the sum of squared mistakes 


n 


NO [Vi — D (bo + bi Xai +++ + bk Xr). 


i=l 


NLS estimation is a consistent approach that produces estimates which are nor- 
mally distributed in large samples. In R there are functions like nls() from pack- 
age stats which provide algorithms for solving nonlinear least squares problems. 
However, NLS is inefficient, meaning that there are estimation techniques that 
have a smaller variance which is why we will not dwell any further on this topic. 
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Maximum Likelihood Estimation 


In MLE we seek to estimate the unknown parameters choosing them such that 
the likelihood of drawing the sample observed is maximized. This probability is 
measured by means of the likelihood function, the joint probability distribution 
of the data treated as a function of the unknown parameters. Put differently, 
the maximum likelihood estimates of the unknown parameters are the values 
that result in a model which is most likely to produce the data observed. It 
turns out that MLE is more efficient than NLS. 


As maximum likelihood estimates are normally distributed in large samples, 
statistical inference for coefficients in nonlinear models like Logit and Probit 
regression can be made using the same tools that are used for linear regression 
models: we can compute ¢-statistics and confidence intervals. 


Many software packages use an MLE algorithm for estimation of nonlinear mod- 
els. The function glm() uses an algorithm named iteratively reweighted least 
squares. 


Measures of Fit 


It is important to be aware that the usual R? and R? are invalid for nonlinear 
regression models. The reason for this is simple: both measures assume that 
the relation between the dependent and the explanatory variable(s) is linear. 
This obviously does not hold for Probit and Logit models. Thus R? need not lie 
between 0 and 1 and there is no meaningful interpretation. However, statistical 
software sometimes reports these measures anyway. 


There are many measures of fit for nonlinear regression models and there is no 
consensus which one should be reported. The situation is even more compli- 
cated because there is no measure of fit that is generally meaningful. For models 
with a binary response variable like deny one could use the following rule: 

IY; = 1 and P(Y;|Xi,...,Xix) > 0.5 or if Y; = 0 and P(Y;|Xin,..., Xin) < 0.5, 
consider the Y; as correctly predicted. Otherwise Y; is said to be incorrectly pre- 
dicted. The measure of fit is the share of correctly predicted observations. The 
downside of such an approach is that it does not mirror the quality of the predic- 
tion: whether P(Y; = 1|Xin,..., Xin) = 0.51 or P(Y: = 1]|Xin,..., Xin) = 0.99 
is not reflected, we just predict Y; = 1.1 


An alternative to the latter are so called pseudo-R? measures. In order to mea- 
sure the quality of the fit, these measures compare the value of the maximized 
(log-)likelihood of the model with all regressors (the full model) to the likelihood 
of a model with no regressors (null model, regression on a constant). 





1This is in contrast to the case of a numeric dependent variable where we use the squared 
errors for assessment of the quality of the prediction. 
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For example, consider a Probit regression. The pseudo-R? is given by 


In( faz 
pseudo-R? = 1 — In(ffutr ) 


m (frut) 


where f7"°" € [0,1] denotes the maximized likelihood for model j. 


The reasoning behind this is that the maximized likelihood increases as ad- 
ditional regressors are added to the model, similarly to the decrease in SSR 
when regressors are added in a linear regression model. If the full model has a 
similar maximized likelihood as the null model, the full model does not really 
improve upon a model that uses only the information in the dependent vari- 
able, so pseudo-R? ~ 0. If the full model fits the data very well, the maximized 
likelihood should be close to 1 such that In( ffun ) + 0 and pseudo-R? ~ 1. See 
Appendix 11.2 of the book for more on MLE and pseudo-R? measures. 


summary () does not report pseudo- R? for models estimated by glm() but we can 
use the entries residual deviance (deviance) and null deviance (null .deviance) 
instead. These are computed as 


deviance = —2 x [In( frit tea) — Inf futi )] 


saturated 
and 
: a max max 
null deviance = —2 x [In( Tee) g In( null )] 
where fo ated 18 the maximized likelihood for a model which assumes that each 


observation has its own parameter (there are n + 1 parameters to be estimated 
which leads to a perfect fit). For models with a binary dependent variable, it 
holds that 





i ln max 
pseudo-R? = 1 deviare er (Sfat 3 
null deviance ln( mar) 


We now compute the pseudo-R? for the augmented Probit model of mortgage 
denial. 


# compute pseudo-R2 for the probit model of mortgage denial 
pseudoR2 <- 1 - (denyprobit2$deviance) / (denyprobit2$null.deviance) 
pseudoR2 

#> [1] 0.08594259 


Another way to obtain the pseudo-R? is to estimate the null model using glm() 
and extract the maximized log-likelihoods for both the null and the full model 
using the function logLikQ. 
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# compute the null model 

denyprobit_null <- glm(formula = deny ~ 1, 
family = binomial(link = "probit"), 
data = HMDA) 


# compute the pseudo-R2 using 'logLik' 
1 - logLik(denyprobit2) [1] /logLik(denyprobit_nul1) [1] 
#> [1] 0.08594259 


11.4 Application to the Boston HMDA Data 


Models (11.6) and (11.7) indicate that denial rates are higher for African Amer- 
ican applicants holding constant the payment-to-income ratio. Both results 
could be subject to omitted variable bias. In order to obtain a more trust- 
worthy estimate of the effect of being black on the probability of a mortgage 
application denial we estimate a linear probability model as well as several Logit 
and Probit models. We thereby control for financial variables and additional 
applicant characteristics which are likely to influence the probability of denial 
and differ between black and white applicants. 


Sample averages as shown in Table 11.1 of the book can be easily reproduced 
using the functions mean() (as usual for numeric variables) and prop.table() 
(for factor variables). 


# Mean P/I ratio 
mean (HMDA$pirat) 
#> 1] 03308136 


# inhouse expense-to-total-income ratio 
mean (HMDA$hirat) 
#> [1] 0.2553461 


# loan-to-value ratio 
mean (HMDA$lvrat) 
#> [1] 0.73877759 


# consumer credit score 
mean(as.numeric(HMDA$chist) ) 
FOU 2116387 


# mortgage credit score 
mean (as.numeric(HMDA$mhist) ) 
#> [1] 1.721008 
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# public bad credit record 
mean(as.numeric (HMDA$phist) -1) 
#> [1] 0.07352941 


# denied mortgage insurance 
prop.table(table(HMDA$insurance) ) 
#> 

#> no yes 

#> 0.97983193 0.02016807 


# self-employed 
prop.table(table(HMDA$selfemp) ) 
#> 

#> no yes 

#> 0.8836134 0.1163866 


# single 
prop.table(table(HMDA$single) ) 
#> 

#> no yes 


#> 0.6067227 0.3932773 


# high school diploma 
prop.table(table(HMDA$hschool) ) 
#> 

#> no yes 

#> 0.01638655 0.98361345 


# unemployment rate 
mean (HMDA$unemp) 
#> [1] 3.774496 


# condominium 
prop.table(table(HMDA$condomin) ) 
#> 

#> no yes 

#> 0.7117647 0.2882353 


# black 
prop.table(table(HMDA$black) ) 
#> 

#> no yes 


#> 0.857563 0.142437 


# deny 
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prop.table (table (HMDA$deny) ) 
#> 

#> 0 1 

#> 0.8802521 0. 1197479 


See Chapter 11.4 of the book or use R’s help function for more on variables 
contained in the HMDA dataset. 


Before estimating the models we transform the loan-to-value ratio (lvrat) into 
a factor variable, where 


low if lurat < 0.8, 
lurat = $ medium if 0.8 < lvrat < 0.95, 
high if lurat > 0.95 


and convert both credit scores to numeric variables. 


# define low, medium and high loan-to-value ratio 

HMDA$lvrat <- factor ( 
ifelse(HMDA$lvrat < 0.8, "low", 
ifelse(HMDA$lvrat >= 0.8 & HMDA$lvrat <= 0.95, "medium", "high")), 
levels = c("low", "medium", "high")) 


# convert credit scores to numeric 
HMDA$mhist <- as.numeric(HMDA$mhist) 
HMDA$chist <- as.numeric(HMDA$chist) 


Next we reproduce the estimation results presented in Table 11.2 of the book. 


# estimate all 6 models for the denial probability 
lpm_HMDA <- lm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
+ insurance + selfemp, data = HMDA) 


logit_HMDA <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
+ insurance + selfemp, 
family = binomial(link = "logit"), 
data = HMDA) 


probit_HMDA_1 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
+ insurance + selfemp, 
family = binomial(link = "probit"), 
data = HMDA) 
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probit_HMDA_2 <- gim(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
+ insurance + selfemp + single + hschool + unemp, 
family = binomial(link = "probit"), 
data = HMDA) 


probit_HMDA_3 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist 
+ phist + insurance + selfemp + single + hschool + unemp + condomin 
+ I(mhist==3) + I(mhist==4) + I(chist==3) + I(chist==4) + I(chist==5) 
+ I(chist==6) , 
family = binomial(link = "probit"), 
data = HMDA) 


probit_HMDA_4 <- glm(deny ~ black * (pirat + hirat) + lvrat + chist + mhist + phist 
+ insurance + selfemp + single + hschool + unemp, 
family = binomial(link = "probit"), 
data = HMDA) 


Just as in previous chapters, we store heteroskedasticity-robust standard errors 
of the coefficient estimators in a list which is then used as the argument se in 
stargazer(). 


rob_se <- list(sqrt(diag(vcovHC(lpm_HMDA, type = "HCi"))), 
sqrt (diag (vcovHC(logit_HMDA, type = "HCi"))), 
sqrt (diag (vcovHC(probit_HMDA_1, type = "HCi"))), 
sqrt (diag (vcovHC(probit_HMDA_2, type = "HCi"))), 
sqrt (diag (vcovHC(probit_HMDA_3, type = "HCi"))), 
sqrt (diag (vcovHC(probit_HMDA_4, type = "HCi")))) 


stargazer(lpm_HMDA, logit_HMDA, probit_HMDA_1, 
probit_HMDA_2, probit_HMDA_3, probit_HMDA_4, 
digits = 3, 
type = "latex", 
header = FALSE, 
se = rob_se, 
model.numbers = FALSE, 
colum labels i ic! Gulp) mC T (3) ween (Ae) C emu (G)) "") ) 


In Table 11.1, models (1), (2) and (3) are baseline specifications that include 
several financial control variables. They differ only in the way they model the 
denial probability. Model (1) is a linear probability model, model (2) is a Logit 
regression and model (3) uses the Probit approach. 


In the linear model (1), the coefficients have direct interpretation. For example, 
an increase in the consumer credit score by 1 unit is estimated to increase the 
probability of a loan denial by about 0.031 percentage points. Having a high 
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Table 11.1: HMDA Data: LPM, Probit and Logit Models 
Dependent variable: 
deny 
OLS logistic probit 
0) (2) (3) (4) (5) (6) 





blackyes 
pirat 

hirat 
lvratmedium 
lvrathigh 
chist 

mhist 
phistyes 
insuranceyes 
selfempyes 
singleyes 
hschoolyes 
unemp 
condominyes 
I(mhist == 3) 
I(mhist == 4) 
I(chist == 3) 
I(chist == 4) 
I(chist == 5) 
I(chist == 6) 
blackyes:pirat 
blackyes:hirat 


0.084*** (0.023) 
0.449*** (0.114) 
—0.048 (0.110) 
0.031** (0.013) 
0.189*** (0.050) 
0.031*** (0.005) 
0.021* (0.011) 
0.197*** (0.035) 
0.702*** (0.045) 
0.060*** (0.021) 


0.688*** (0.183) 
4.764*** (1.332) 
—0.109 (1.298) 
0.464*** (0.160) 
1.495*** (0.325) 
0.290*** (0.039) 
0.279** (0.138) 
1.226*** (0.203) 
4.548*** (0.576) 
0.666*** (0.214) 


0.389*** (0.099) 
2.442*** (0.673) 
—0.185 (0.689) 
0.214*** (0.082) 
0.791*** (0.183) 
0.155*** (0.021) 
0.148** (0.073) 
0.697*** (0.114) 
2.557*** (0.305) 
0.359*** (0.113) 


0.371*** (0.100) 
2.464*** (0.654) 
—0.302 (0.689) 
0.216*** (0.082) 
0.795*** (0.184) 
0.158*** (0.021) 
0.110 (0.076) 
0.702*** (0.115) 
2.585*** (0.299) 
0.346*** (0.116) 
0.229*** (0.080) 
—0.613*** (0.229) 
0.030* (0.018) 


0.363*** (0.101) 
2.622*** (0.665) 
—0.502 (0.715) 
0.215** (0.084) 
0.836*** (0.185) 
0.344*** (0.108) 
0.162 (0.104) 
0.717*** (0.116) 
2.589*** (0.306) 
0.342*** (0.116) 
0.230*** (0.086) 


—0.604** (0.237) 


0.028 (0.018) 
—0.055 (0.096) 
—0.107 (0.301) 
—0.383 (0.427) 
—0.226 (0.248) 
—0.251 (0.338) 
—0.789* (0.412) 
—0.905* (0.515) 


0.246 (0.479) 
2.572*** (0.728) 
—0.538 (0.755) 
0.216*** (0.083) 
0.788*** (0.185) 
0.158*** (0.021) 

0.111 (0.077) 
0.705*** (0.115) 
2.590*** (0.299) 
0.348*** (0.116) 
0.226*** (0.081 


) 
—0.620*** (0.229) 


0.030 (0.018) 


—0.579 (1.550) 
1.232 (1.709) 





Constant —0.183*** (0.028)  —5.707*** (0.484) —3.041*** (0.250) —2.575*** (0.350) —2.896*** (0.404) —2.543*** (0.370) 
Observations 2,380 2,380 2,380 2,380 2,380 2,380 

R? 0.266 

Adjusted R? 0.263 

Log Likelihood —635.637 —636.847 —628.614 — 625.064 —628.332 
Akaike Inf. Crit. 1,293.273 1,295.694 1,285.227 1,292.129 1,288.664 


Residual Std. Error 
F Statistic 


0.279 (df = 2369) 


85.974*** (df = 10; 2369) 








Note: 


*p<0.1; **p<0.05; ***p<0.01 
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loan-to-value ratio is detriment for credit approval: the coefficient for a loan-to- 
value ratio higher than 0.95 is 0.189 so clients with this property are estimated to 
face an almost 19% larger risk of denial than those with a low loan-to-value ratio, 
ceteris paribus. The estimated coefficient on the race dummy is 0.084, which 
indicates the denial probability for African Americans is 8.4% larger than for 
white applicants with the same characteristics except for race. Apart from the 
housing-expense-to-income ratio and the mortgage credit score, all coefficients 
are significant. 


Models (2) and (3) provide similar evidence that there is racial discrimination 
in the U.S. mortgage market. All coefficients except for the housing expense- 
to-income ratio (which is not significantly different from zero) are significant at 
the 1% level. As discussed above, the nonlinearity makes the interpretation of 
the coefficient estimates more difficult than for model (1). In order to make a 
statement about the effect of being black, we need to compute the estimated 
denial probability for two individuals that differ only in race. For the comparison 
we consider two individuals that share mean values for all numeric regressors. 
For the qualitative variables we assign the property that is most representative 
for the data at hand. For example, consider self-employment: we have seen that 
about 88% of all individuals in the sample are not self-employed such that we 
set selfemp = no. Using this approach, the estimate for the effect on the denial 
probability of being African American of the Logit model (2) is about 4%. The 
next code chunk shows how to apply this approach for models (1) to (7) using 
R. 


# comppute regressor values for an average black person 
new <- data.frame( 

"pirat" = mean(HMDA$pirat), 

"hirat" = mean(HMDA$hirat) , 

Wivra ti = "hows 

"chist" = mean(HMDA$chist) , 

"mhist" = mean(HMDA$mhist) , 


wpbasitu = "jo" 

"insurance" = "no", 
"selfemp" = "no" s 

"black" = c("no" : "ves") . 
"single" = "io. 5 

"hschool" = "yes", 

"unemp" = mean(HMDA$unemp) , 
"condomin" = "no" 


# differnce predicted by the LPM 

predictions <- predict(lpm_HMDA, newdata = new) 
diff (predictions) 

#> 2 

#> 0. 08369674 
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# differnce predicted by the logit model 


predictions <- predict(logit_HMDA, newdata = new, type = "response") 
diff (predictions) 
#> 2 


#> 0.04042135 


# difference predicted by probit model (3) 

predictions <- predict(probit_HMDA_1, newdata = new, type = "response") 
diff (predictions) 

#> 2 

#> 0.05049716 


# difference predicted by probit model (4) 

predictions <- predict(probit_HMDA_2, newdata = new, type = "response") 
diff (predictions) 

#> 2 

#> 0.03978918 


# difference predicted by probit model (5) 

predictions <- predict(probit_HMDA_3, newdata = new, type = "response") 
diff (predictions) 

#> 2 

#> 0.04972468 


# difference predicted by probit model (6) 

predictions <- predict(probit_HMDA_4, newdata = new, type = "response") 
diff (predictions) 

#> 2 

#> 0.03955893 


The estimates of the impact on the denial probability of being black are similar 
for models (2) and (3). It is interesting that the magnitude of the estimated 
effects is much smaller than for Probit and Logit models that do not control 
for financial characteristics (see section 11.2). This indicates that these simple 
models produce biased estimates due to omitted variables. 


Regressions (4) to (6) use regression specifications that include different appli- 
cant characteristics and credit rating indicator variables as well as interactions. 
However, most of the corresponding coefficients are not significant and the es- 
timates of the coefficient on black obtained for these models as well as the 
estimated difference in denial probabilities do not differ much from those ob- 
tained for the similar specifications (2) and (3). 


An interesting question related to racial discrimination can be investigated 
using the Probit model (6) where the interactions blackyes:pirat and 
blackyes:hirat are added to model (4). If the coefficient on blackyes: pirat 
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was different from zero, the effect of the payment-to-income ratio on the denial 
probability would be different for black and white applicants. Similarly, a 
non-zero coefficient on blackyes:hirat would indicate that loan officers weight 
the risk of bankruptcy associated with a high loan-to-value ratio differently for 
black and white mortgage applicants. We can test whether these coefficients 
are jointly significant at the 5% level using an F-Test. 


linearHypothesis(probit_HMDA_4, 
test = "F", 
c("blackyes:pirat=0", "blackyes:hirat=0"), 
vcov = vcovHC, type = "HCi") 

#> Linear hypothesis test 


#> Hypothesis: 
#> blackyes:pirat = 0 
#> blackyes:hirat = 0 


#> Model 1: restricted model 
#> Model 2: deny ~ black * (pirat + hirat) + lurat + chist + mhist + phist + 


#> insurance + selfemp + single + hschool + unemp 
#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df iS Jean Gdey, 

#> 1 2366 


#52 2361 2 0.2073 0.7809 


Since p-value ~ 0.77 for this test, the null cannot be rejected. Nonetheless, we 
can reject the hypothesis that there is no racial discrimination at all since the 
corresponding F-test has a p-value of about 0.002. 


linearHypothesis(probit_HMDA_4, 
test = "F", 
c("blackyes=0", "blackyes:pirat=0", "“blackyes:hirat=0"), 
vcov = vcovHC, type = "HCi") 

#> Linear hypothesis test 


#> Hypothesis: 

#> blackyes = 0 

#> blackyes:pirat = 0 
#> blackyes:hirat = 0 


#> Model 1: restricted model 
#> Model 2: deny ~ black * (pirat + hirat) + lurat + chist + mhist + phist + 
#> insurance + selfemp + single + hschool + unemp 
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#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df Eee Ey) 

#> 1 2367 

#> 2 2364 3 4.7774 0.002534 ** 

a 

#> Stigntf. codes: 0 '***" 07001 "x" 0701 "*! 0705" ." O71! 1 a 
Summary 


Models (1) to (6) provide evidence that there is an effect of being African Ameri- 
can on the probability of a mortgage application denial: in all specifications, the 
effect is estimated to be positive (ranging from 4% to 5%) and is significantly 
different from zero at the 1% level. While the linear probability model seems to 
slightly overestimate this effect, it still can be used as an approximation to an 
intrinsically nonlinear relationship. 


See Chapters 11.4 and 11.5 of the book for a discussion of external and internal 
validity of this study and some concluding remarks on regression models where 
the dependent variable is binary. 


11.5 Exercises 


This interactive part of the book is only available in the HTML version. 


Chapter 12 


Instrumental Variables 
Regression 


As discussed in Chapter 9, regression models may suffer from problems like omit- 
ted variables, measurement errors and simultaneous causality. If so, the error 
term is correlated with the regressor of interest and so that the corresponding 
coefficient is estimated inconsistently. So far we have assumed that we can add 
the omitted variables to the regression to mitigate the risk of biased estimation 
of the causal effect of interest. However, if omitted factors cannot be measured 
or are not available for other reasons, multiple regression cannot solve the prob- 
lem. The same issue arises if there is simultaneous causality. When causality 
runs from X to Y and vice versa, there will be an estimation bias that cannot 
be corrected for by multiple regression. 


A general technique for obtaining a consistent estimator of the coefficient of 
interest is instrumental variables (IV) regression. In this chapter we focus on the 
IV regression tool called two-stage least squares (TSLS). The first sections briefly 
recap the general mechanics and assumptions of IV regression and show how 
to perform TSLS estimation using R. Next, IV regression is used for estimating 
the elasticity of the demand for cigarettes — a classical example where multiple 
regression fails to do the job because of simultaneous causality. 


Just like for the previous chapter, the packages AER (Kleiber and Zeileis, 2020) 
and stargazer (Hlavac, 2018) are required for reproducing the code presented 
in this chapter. Check whether the code chunk below executes without any 
error messages. 


library (AER) 
library (stargazer) 
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12.1 The IV Estimator with a Single Regressor 
and a Single Instrument 


Consider the simple regression model 


Yi = bo + bi Xi +u; , i=l,...,n (12.1) 


where the error term u; is correlated with the regressor X; (X is endogenous) 
such that OLS is inconsistent for the true 6,. In the most simple case, IV 
regression uses a single instrumental variable Z to obtain a consistent estimator 


for 61. 

Z must satisfy two conditions to be a valid instrument: 
1. Instrument relevance condition: 

X and its instrument Z must be correlated: pz, x, # 0. 
2. Instrument exogeneity condition: 


The instrument Z must not be correlated with the error term u: PZ; u; = 0. 


The Two-Stage Least Squares Estimator 


As can be guessed from its name, TSLS proceeds in two stages. In the first stage, 
the variation in the endogenous regressor X is decomposed into a problem-free 
component that is explained by the instrument Z and a problematic component 
that is correlated with the error u;. The second stage uses the problem-free 
component of the variation in X to estimate (1. 


The first stage regression model is 
Xi = To + T1 Zi + vi, 


where 7 + T1 Z; is the component of X; that is explained by Z; while v; is the 
component that cannot be explained by Z; and exhibits correlation with ui. 


Using the OLS estimates 7 and 7 Tı we obtain predicted values Xi, t=1,...,n. 
If Z is a valid instrument, the X; are problem-free in the sense that x is ex- 
ogenous in a regression of Y on X which is done in the second stage regression. 
The second stage produces BTSLS and BT SLS the TSLS estimates of Gp and 


pi. 


For the case of a single instrument one can show that the TSLS estimator of 61 
is 
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BTSLS — SZY = t Yay ~~ Y)\(Z = Z) (12 2) 
i ur oaoa- A- 





which is nothing but the ratio of the sample covariance between Z and Y to the 
sample covariance between Z and X. 


As shown in Appendix 12.3 of the book, (12.2) is a consistent estimator for 3; in 
(12.1) under the assumption that Z is a valid instrument. Just as for every other 
OLS estimator we have considered so far, the CLT implies that the distribution 
of BF S45 can be approximated by a normal distribution if the sample size is 
large. This allows us to use t-statistics and confidence intervals which are also 
computed by certain R functions. A more detailed argument on the large-sample 
distribution of the TSLS estimator is sketched in Appendix 12.3 of the book. 


Application to the Demand For Cigarettes 


The relation between the demand for and the price of commodities is a simple 
yet widespread problem in economics. Health economics is concerned with the 
study of how health-affecting behavior of individuals is influenced by the health- 
care system and regulation policy. Probably the most prominent example in 
public policy debates is smoking as it is related to many illnesses and negative 
externalities. 


It is plausible that cigarette consumption can be reduced by taxing cigarettes 
more heavily. The question is by how much taxes must be increased to reach 
a certain reduction in cigarette consumption. Economists use elasticities to an- 
swer this kind of question. Since the price elasticity for the demand of cigarettes 
is unknown, it must be estimated. As discussed in the box Who Invented In- 
strumental Variables Regression presented in Chapter 12.1 of the book, an OLS 
regression of log quantity on log price cannot be used to estimate the effect 
of interest since there is simultaneous causality between demand and supply. 
Instead, IV regression can be used. 


We use the data set CigarettesSW which comes with the package AER. It is a 
panel data set that contains observations on cigarette consumption and several 
economic indicators for all 48 continental federal states of the U.S. from 1985 
to 1995. Following the book we consider data for the cross section of states in 
1995 only. 


We start by loading the package, attaching the data set and getting an overview. 


# load the data set and get an overview 
library (AER) 

data("CigarettesSW") 

summary (CigarettesSW) 
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#> state year cpi population packs 

#> AL P2195578 Min: LAO Om Man ©: 478447 Min. 149.27 
#> AR T2995 AO Sty Quiss1 076 Me lst ues 16226000 sist atu.: 92.45 
#> AZ cae) Median :1.300 Median : 3697472 Median :110.16 
#> CA ee, Mean 21.300 Mean : 5168866 Mean 5109.18 
#> CO 2 Srd Qu.:1.524 3rd Qu.: 5901500 3rd Qu.:123.52 
#> CT ae, Maz. +1024 Maz: 131493524 Maz. 2197:99 
#> (Other):84 

#> income taz price tars 

#> Min. : 6887097 Min. 118.00 Min. 64,97 Mani. 221.27 

#> 1st Qu.: 25520384 Hote gu o0 Far Oa S LOA IE LSQUE. TT 

#> Median : 61661644 Median :37.00 Median :137.72 Median : 41.05 

#> Mean : 99878736 Mean :42.68 Mean £143.45 Mean : 48.33 

#> 3rd Qu. :127313964 3rd Qu.:50.88 3rd Qu.:176.15 3rd Qu.: 59.48 

#> Maz. : 771470144 = Maz. 799.00 Maz. 1240.85 Maz. £112.63 

#> 


Use ?CigarettesSwW for a detailed description of the variables. 


We are interested in estimating 6, in 


lex) = Bo g Bı log ee) + Ui, (12.3) 


u 


i tt : : : i tt 
where Q¢'9°"" is the number of cigarette packs per capita sold and P97" 


is the after-tax average real price per pack of cigarettes in state i. 


The instrumental variable we are going to use for instrumenting the endogenous 
regressor log( P“9"""""**) is SalesTaa, the portion of taxes on cigarettes arising 
from the general sales tax. SalesTax is measured in dollars per pack. The 
idea is that SalesTaz is a relevant instrument as it is included in the after-tax 
average price per pack. Also, it is plausible that SalesTax is exogenous since 
the sales tax does not influence quantity sold directly but indirectly through the 


price. 


We perform some transformations in order to obtain deflated cross section data 
for the year 1995. 


We also compute the sample correlation between the sales tax and price per 
pack. The sample correlation is a consistent estimator of the population cor- 
relation. The estimate of approximately 0.614 indicates that SalesTax and 
pests’ exhibit positive correlation which meets our expectations: higher 
sales taxes lead to higher prices. However, a correlation analysis like this is not 
sufficient for checking whether the instrument is relevant. We will later come 
back to the issue of checking whether an instrument is relevant and exogenous. 
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# compute real per capita prices 
CigarettesSW$rprice <- with(CigarettesSW, price / cpi) 


# compute the sales tarz 
CigarettesSW$salestax <- with(CigarettesSW, (taxs - tax) / cpi) 


# check the correlation between sales tax and price 
cor (CigarettesSW$salestax, CigarettesSW$price) 
#> [1] 0.6141228 


# generate a subset for the year 1995 
c1995 <- subset (CigarettesSW, year == "1995") 


The first stage regression is 


log(P%90rettes) — no + mı SalesTax; + vi- 


We estimate this model in R using 1m(). In the second stage we run a regression 


of log(Q%92"""*) on log(P90ettes) to obtain BFS"5 and BTS4S, 


a 


# perform the first stage regression 
cig_si <- lm(log(rprice) ~ salestax, data = c1995) 


coeftest(cig_si1, vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) 4.6165463 0.0289177 159.6444 < 2.2e-16 *** 

#> salestar  0.0307289 0.0048354 6.3549 8.489e-08 *** 
Soe 

#> Signif. codes: O '***' 0.001 '#*' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


The first stage regression is 


log(P“977e"**) — 4.62 + 0.031 SalesTax; 
(0.03) (0.005) 


which predicts the relation between sales tax price per cigarettes to be positive. 
How much of the observed variation in log(P4isarettes) is explained by the in- 
strument SalesTax? This can be answered by looking at the regression’s R? 
which states that about 47% of the variation in after tax prices is explained by 
the variation of the sales tax across states. 
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# inspect the R°2 of the first stage regression 
summary (cig_s1)$r.squared 
#> [1] 0.4709961 


We next store log(P“97"°""*), the fitted values obtained by the first stage re- 


a 
gression cig_s1, in the variable lcigp_pred. 


# store the predicted values 
lcigp_pred <- cig_si$fitted.values 


Next, we run the second stage regression which gives us the TSLS estimates we 
seek. 


# run the stage 2 regression 
cig_s2 <- lm(log(c1995$packs) ~ lcigp_pred) 
coeftest(cig_s2, vcov = vcovHC) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) 9.71988 1.70304 5.707, 7.932e-07 *** 

#> Lcigp_pred -1.08359 0.35563 -3.0469 0.003822 ** 

#> --- 

#> Signif. codes: O 'xxx! 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Thus estimating the model (12.3) using TSLS yields 


log Quarts) — 9.79 + 1.08 log(Prisarettes 12.4 
og(Q; l= To BA ), (12.4) 


—_—. 
prigarettes 


cigarettes 
P; a 


where we write log(P; 


the book. 


) instead of log( ) for consistency with 


The function ivreg() from the package AER carries out TSLS procedure auto- 
matically. It is used similarly as lm(). Instruments can be added to the usual 
specification of the regression formula using a vertical bar separating the model 
equation from the instruments. Thus, for the regression at hand the correct 
formula is log(packs) ~ log(rprice) | salestax. 


# perform TSLS using ‘ivregQ' 
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax, data = c1995) 
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coeftest(cig_ivreg, vcov = vcovHC, type = "HC1i") 


#> 

#> t test of coeffictents: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) 9.71988 1.52832 6.3598 8.346e-08 *** 

#> log(rprice) -1.08359 0.31892 -3.3977 0.001411 ** 

#> --- 

A> Signij- codes: O Crkt TO OOT ee! MO OI (XO TONOS) Ont Id 


We find that the coefficient estimates coincide for both approaches. 


Two Notes on the Computation of TSLS Standard Errors 


1. We have demonstrated that running the individual regressions for each 
stage of TSLS using 1m() leads to the same coefficient estimates as when 
using ivreg(). However, the standard errors reported for the second- 
stage regression, e.g., by coeftest() or summary(), are invalid: neither 
adjusts for using predictions from the first-stage regression as regressors 
in the second-stage regression. Fortunately, ivreg() performs the nec- 
essary adjustment automatically. This is another advantage over manual 
step-by-step estimation which we have done above for demonstrating the 
mechanics of the procedure. 


2. Just like in multiple regression it is important to compute heteroskedasticity- 
robust standard errors as we have done above using vcovHC(). 


The TSLS estimate for 5; in (12.4) suggests that an increase in cigarette prices 
by one percent reduces cigarette consumption by roughly 1.08 percentage points, 
which is fairly elastic. However, we should keep in mind that this estimate might 
not be trustworthy even though we used IV estimation: there still might be a 
bias due to omitted variables. Thus a multiple IV regression approach is needed. 


12.2 The General IV Regression Model 


The simple IV regression model is easily extended to a multiple regression model 
which we refer to as the general IV regression model. In this model we distin- 
guish between four types of variables: the dependent variable, included exoge- 
nous variables, included endogenous variables and instrumental variables. Key 
Concept 12.1 summarizes the model and the common terminology. See Chapter 
12.2 of the book for a more comprehensive discussion of the individual compo- 
nents of the general model. 
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Key Concept 12.1 
The General Instrumental Variables Regression Model and Ter- 
minology 


W = [bia cr [on Ning tp 9° ae BOG se ea Wig te 09 8 te Eritrean ae Wi, 
(12.5) 


with i = 1,...,n is the general instrumental variables regression model 
where 


Y; is the dependent variable 
B1, ---, Bk+r are 1 + k +r unknown regression coefficients 
Xii,- -, Xki are k endogenous regressors 


Wi;,...,W,; are r exogenous regressors which are uncorrelated 
with u; 


u; is the error term 


Zli,- --, Zmi are m instrumental variables 


The coefficients are overidentified if m > k. If m < k, the coefficients 
are underidentified and when m = k they are exactly identified. For 
estimation of the IV regression model we require exact identification or 
overidentification. 





While computing both stages of TSLS individually is not a big deal in (12.1), 
the simple regression model with a single endogenous regressor, Key Concept 
12.2 clarifies why resorting to TSLS functions like ivreg() are more convenient 
when the set of potentially endogenous regressors (and instruments) is large. 


Estimating regression models with TSLS using multiple instruments by means 
of ivreg() is straightforward. There are, however, some subtleties in correctly 
specifying the regression formula. 


Assume that you want to estimate the model 





Yi = Bo + PiX1i + b2Xzi + Wii + ui 


where X 1; and Xo; are endogenous regressors that shall be instrumented by Z;, 
Zy; and Z3; and W4; is an exogenous regressor. The corresponding data is avail- 
able in a data.frame with column names y, x1, x1, w1, z1, z2 and z3. It might 
be tempting to specify the argument formula in your call of ivreg() as y ~ 
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x1 + x2 + wi | z1 + z2 + z3 which is wrong. As explained in the documen- 
tation of ivreg() (see ?ivreg), it is necessary to list all exogenous variables as 
instruments too, that is joining them by +’s on the right of the vertical bar: y 
~ x1 + x2 + w1 | wi + z1 + z2 + z3 where wi is “instrumenting itself”. 


If there is a large number of exogenous variables it may be convenient to provide 
an update formula with a . (this includes all variables except for the dependent 
variable) right after the | and to exclude all endogenous variables using a -. For 
example, if there is one exogenous regressor w1 and one endogenous regressor 
x1 with instrument z1, the appropriate formula would be y ~ wi + x1 | wil 
+ zi which is equivalent toy ~ wi + x1 | . - x1 + z1. 


Key Concept 12.2 
Two-Stage Least Squares 


Similarly to the simple IV regression model, the general IV model (12.5) 
can be estimated using the two-stage least squares estimator: 


e First-stage regression(s) 
Run an OLS regression for each of the endogenous variables 
(Xii, ---,Xki) on all instrumental variables (Z1;,...,Z%mi), all 


exogenous variables (W1;,...,W,i;) and an intercept. Compute 
the fitted values (X1;,..., Xxi)- 


Second-stage regression 

Regress the dependent variable on the predicted values of all en- 
dogenous regressors, all exogenous variables and an intercept us- 
ing OLS. This gives 87°"°,..., Bf3"5, the TSLS estimates of the 
model coefficients. 





In the general IV regression model, the instrument relevance and instrument 
exogeneity assumptions are the same as in the simple regression model with a 
single endogenous regressor and only one instrument. See Key Concept 12.3 for 
a recap using the terminology of general IV regression. 
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Key Concept 12.3 
Two Conditions for Valid Instruments 


For Zii... Zmi to be a set of valid instruments, the following two 
conditions must be fulfilled: 


1. Instrument Relevance 
If there are k endogenous variables, r exogenous variables and m > 
k instruments Z and the Xj;,..., Xý; are the predicted values from 
the k population first stage regressions, it must hold that 


ce ees X ia Win 20 og Wirin ll) 


are not perfectly multicollinear. 1 denotes the constant regressor 
which equals 1 for all observations. 


Note: If there is only one endogenous regressor X;, there must 
be at least one non-zero coefficient on the Z and the W in the 
population regression for this condition to be valid: if all of the 
coefficients are zero, all the Xy are just the mean of X such that 
there is perfect multicollinearity. 


. Instrument Exogeneity 
All m instruments must be uncorrelated with the error term, 


Pinui =9,--+5PZmisu; = 9- 





One can show that if the IV regression assumptions presented in Key Concept 
12.4 hold, the TSLS estimator in (12.5) is consistent and normally distributed 
when the sample size is large. Appendix 12.3 of the book deals with a proof in 
the special case with a single regressor, a single instrument and no exogenous 
variables. The reasoning behind this carries over to the general IV model. 
Chapter 18 of the book proves a more complicated explanation for the general 
case. 


For our purposes it is sufficient to bear in mind that validity of the assumptions 
stated in Key Concept 12.4 allows us to obtain valid statistical inference using 
R functions which compute t-Tests, F-Tests and confidence intervals for model 
coefficients. 
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Key Concept 12.4 
The IV Regression Assumptions 


For the general IV regression model in Key Concept 12.1 we assume the 
following: 


5 E(uil|Wi, asa Wri) = 0. 


. (Xii, 586 XG Wii, Say Wris Li, ES Za) are i.i.d. draws from 
their joint distribution. 


. All variables have nonzero finite fourth moments, i.e., outliers are 
unlikely. 


. The Zs are valid instruments (see Key Concept 12.3). 





Application to the Demand for Cigarettes 


The estimated elasticity of the demand for cigarettes in (12.1) is 1.08. Although 
(12.1) was estimated using IV regression it is plausible that this IV estimate is 
biased: in this model, the TSLS estimator is inconsistent for the true 6, if the 
instrument (the real sales tax per pack) correlates with the error term. This is 
likely to be the case since there are economic factors, like state income, which 
impact the demand for cigarettes and correlate with the sales tax. States with 
high personal income tend to generate tax revenues by income taxes and less 
by sales taxes. Consequently, state income should be included in the regression 
model. 


log(Qo977°"**) = By + Bi log(PM97"""*) + Bo log(income;) + ui (12.6) 


a 


Before estimating (12.6) using ivreg() we define income as real per capita 
income rincome and append it to the data set CigarettesSW. 


# add rincome to the dataset 
CigarettesSW$rincome <- with(CigarettesSW, income / population / cpi) 


c1995 <- subset (CigarettesSW, year == "1995") 


# estimate the model 
cig_ivreg2 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + 
salestax, data = c1995) 


coeftest(cig_ivreg2, vcov = vcovHC, type = "HCi") 
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#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) 9.43066 1.25939 7.4883 1.935e-09 **x* 

#> log(rprice) -1.14338 0.37230 -3.0711 0.003611 ** 

#> Log(rincome) 0.21452 0.31175 0.6881 0.494917 

ae 

#> Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' ' 1 


We obtain 
l Jcigarettes = 942 — 1.141 prigarettes 0.211 i i). 12.7 
og(Q; ) (1.26) (0.37) ogl : ) ag (0.31) ogGneome ) ( ) 


Following the book we add the cigarette-specific taxes (cigtax;) as a further 
instrumental variable and estimate again using TSLS. 


# add cigtaz to the data set 
CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi) 


c1995 <- subset (CigarettesSW, year == "1995") 
# estimate the model 
cig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | 


log(rincome) + salestax + cigtax, data = c1995) 


coeftest(cig_ivreg3, vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) 9.89496 0.95922 10.3157 1.947e-13 *** 

#> log(rprice) -1.27742 0.24961 -5.1177 6.211¢e-06 ¥** 

#> Log(rincome) 0.28040 0.25389 1.1044 0.2753 

P 

#> Signif. codes: 0 '*#*' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 T 


Using the two instruments salestax; and cigtax; we have m = 2 and k = 1 
so the coefficient on the endogenous regressor log(P“!*"""**) is overidentified. 
The TSLS estimate of (12.6) is 


] Jeigarettes = 9. — 1.28] prigareties 28 ] 4 i). 12. 
on) = 889 — Apso) + B38 ince, 128 
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Should we trust the estimates presented in (12.7) or rather rely on (12.8)? The 
estimates obtained using both instruments are more precise since in (12.8) all 
standard errors reported are smaller than in (12.7). In fact, the standard error 
for the estimate of the demand elasticity is only two thirds of the standard error 
when the sales tax is the only instrument used. This is due to more information 
being used in estimation (12.8). Jf the instruments are valid, (12.8) can be 
considered more reliable. 


However, without insights regarding the validity of the instruments it is not sen- 
sible to make such a statement. This stresses why checking instrument validity 
is essential. Chapter 12.3 briefly discusses guidelines in checking instrument 
validity and presents approaches that allow to test for instrument relevance and 
exogeneity under certain conditions. These are then used in an application to 
the demand for cigarettes in Chapter 12.4. 


12.3 Checking Instrument Validity 


Instrument Relevance 


Instruments that explain little variation in the endogenous regressor X are called 
weak instruments. Weak instruments provide little information about the vari- 
ation in X that is exploited by IV regression to estimate the effect of interest: 
the estimate of the coefficient on the endogenous regressor is estimated inac- 
curately. Moreover, weak instruments cause the distribution of the estimator 
to deviate considerably from a normal distribution even in large samples such 
that the usual methods for obtaining inference about the true coefficient on X 
may produce wrong results. See Chapter 12.3 and Appendix 12.4 of the book 
for a more detailed argument on the undesirable consequences of using weak 
instruments in IV regression. 
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Key Concept 12.5 
A Rule of Thumb for Checking for Weak Instruments 


Consider the case of a single endogenous regressor X and m instruments 
Z1,---,2m.- If the coefficients on all instruments in the population 
first-stage regression of a TSLS estimation are zero, the instruments 
do not explain any of the variation in the X which clearly violates 
assumption 1 of Key Concept 12.2. Although the latter case is unlikely 
to be encountered in practice, we should ask ourselves to what extent 
the assumption of instrument relevance should be fulfilled. 


While this is hard to answer for general IV regression, in the case of 
a single endogenous regressor X one may use the following rule of thumb: 


Compute the F-statistic which corresponds to the hypothesis that the 
coefficients on Z1,..., Zm are all zero in the first-stage regression. If 
the F-statistic is less than 10, the instruments are weak such that the 
TSLS estimate of the coefficient on X is biased and no valid statistical 
inference about its true value can be made. See also Appendix 12.5 of 
the book. 





The rule of thumb of Key Concept 12.5 is easily implemented in R. Run the first- 
stage regression using 1m() and subsequently compute the heteroskedasticity- 
robust F-statistic by means of linearHypothesis(). This is part of the appli- 
cation to the demand for cigarettes discussed in Chapter 12.4. 


If Instruments are Weak 


There are two ways to proceed if instruments are weak: 


1. Discard the weak instruments and/or find stronger instruments. While 
the former is only an option if the unknown coefficients remain identified 
when the weak instruments are discarded, the latter can be very difficult 
and even may require a redesign of the whole study. 


2. Stick with the weak instruments but use methods that improve upon TSLS 
in this scenario, for example limited information maximum likelihood es- 
timation, see Appendix 12.5 of the book. 


When the Assumption of Instrument Exogeneity is Violated 


If there is correlation between an instrument and the error term, IV regression is 
not consistent (this is shown in Appendix 12.4 of the book). The overidentifying 
restrictions test (also called the J-test) is an approach to test the hypothesis that 
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additional instruments are exogenous. For the J-test to be applicable there need 
to be more instruments than endogenous regressors. The J-test is summarized 
in Key Concept 12.5. 


Key Concept 12.6 
J-Statistic / Overidentifying Restrictions Test 


Take GSES | i =1,...,n, the residuals of the TSLS estimation of the 
general IV regression model 12.5. Run the OLS regression 


i = ĝo + 61244 + +++ + OmZmi + ôm+1 Wii + +++ + OmtrWri + €i 
(12.9) 
and test the joint hypothesis 
Ho : 61 = On Op 0) 


which states that all instruments are exogenous. This can be done using 
the corresponding F’-statistic by computing 


J=mF. 


This test is the overidentifying restrictions test and the statistic is called 
the J-statistic with 


J~ ee 


in large samples under the null and the assumption of homoskedasticity. 
The degrees of freedom m — k state the degree of overidentification since 
this is the number of instruments m minus the number of endogenous 
regressors k. 





It is important to note that the J-statistic discussed in Key Concept 12.6 is 
only y?,_,, distributed when the error term e; in the regression (12.9) is ho- 
moskedastic. A discussion of the heteroskedasticity-robust J-statistic is beyond 
the scope of this chapter. We refer to Section 18.7 of the book for a theoretical 
argument. 


As for the procedure shown in Key Concept 12.6, the application in the next 
section shows how to apply the J-test using linearHypothesis(). 


12.4 Application to the Demand for Cigarettes 


Are the general sales tax and the cigarette-specific tax valid instruments? If not, 
TSLS is not helpful to estimate the demand elasticity for cigarettes discussed 
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in Chapter 12.2. As discussed in Chapter 12.1, both variables are likely to be 
relevant but whether they are exogenous is a different question. 


The book argues that cigarette-specific taxes could be endogenous because there 
might be state specific historical factors like economic importance of the tobacco 
farming and cigarette production industry that lobby for low cigarette specific 
taxes. Since it is plausible that tobacco growing states have higher rates of 
smoking than others, this would lead to endogeneity of cigarette specific taxes. If 
we had data on the size on the tobacco and cigarette industry, we could solve this 
potential issue by including the information in the regression. Unfortunately, 
this is not the case. 


However, since the role of the tobacco and cigarette industry is a factor that 
can be assumed to differ across states but not over time we may exploit the 
panel structure of CigarettesSW instead: as shown in Chapter 10.2, regression 
using data on changes between two time periods eliminates such state specific 
and time invariant effects. Following the book we consider changes in variables 
between 1985 and 1995. That is, we are interested in estimating the long-run 
elasticity of the demand for cigarettes. 


The model to be estimated by TSLS using the general sales tax and the cigarette- 
specific sales tax as instruments hence is 


log(Q; i995 °) — log(Q; oos °°) = Bo + Ai ote e — log(Prigss ) 
+ b2 [log(income;,i995) — log(income;,i9g5)] + ui. 
(12.10) 


We first create differences from 1985 to 1995 for the dependent variable, the 
regressors and both instruments. 


# subset data for year 1985 
c1985 <- subset (CigarettesSW, year == "1985") 


# define differences in variables 
packsdiff <- log(c1995$packs) - log(c1985$packs) 


pricediff <- log(c1995$price/c1995$cpi) - log(c1985$price/c1985$cpi) 


incomediff <- log(c1995$income/c1995$population/c1995$cpi) - 
log(c1985$income/c1985$population/c1985$cpi) 


salestaxdiff <- (c1995$taxs - c1995$tax)/c1995$cpi - (c1985$taxs - c1985$tax)/c1985$cp: 


cigtaxdiff <- c1995$tax/c1995$cpi - c1985$tax/c1985$cpi 


We now perform three different IV estimations of (12.10) using ivreg(): 
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1. TSLS using only the difference in the sales taxes between 1985 and 1995 
as the instrument. 


2. TSLS using only the difference in the cigarette-specific sales taxes 1985 
and 1995 as the instrument. 


3. TSLS using both the difference in the sales taxes 1985 and 1995 and the 
difference in the cigarette-specific sales taxes 1985 and 1995 as instru- 
ments. 


# estimate the three models 
cig_ivreg diffi <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + 
salestaxdiff) 


cig_ivreg diff2 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + 
cigtaxdiff) 


cig_ivreg diff3 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + 
salestaxdiff + cigtaxdiff) 


As usual we use coeftest() in conjunction with vcovHC() to obtain robust 
coefficient summaries for all models. 


# robust coefficient summary for 1. 
coeftest (cig ivreg diff1, vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) -0.117962 0.068217 -1.7292 0.09062 . 

#> pricediff -0.938014 0.207502 -4.5205 4.454e-05 *** 

#> incomediff 0.525970 0.339494 1.5493 0.12832 

Po 

#> Signif. codes: 0 'x*xx' 0.001 '**' 0.01 '*' 0.05 '.!'0.1'!'1 


# robust coefficient summary for 2. 
coeftest(cig_ivreg_diff2, vcov = vcovHC, type = "HC1") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) -0.017049 0.067217 -0.2536 0.8009 

#> pricediff -1.342515 0.228661 -5.8712 4.848e-07 *** 

#> incomediff 0.428146 0.298718 1.4333 0.1587 

#> --- 

#> Signif. codes: 0O '*#*' 0.001 '**' 0.01 '*' 0.05 '.'0.1' '1 
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# robust coefficient summary for 3. 
coeftest(cig_ivreg diff3, vcov = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) -0.052003 0.062488 -0.8322 0.4097 

#> pricediff -1.202403 0.196943 -6.1053 2.178e-07 *** 

#> incomediff 0.462030 0.309341 1.4936 0.1423 

#> --- 

#> Signif. codes: O '*#*' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


We proceed by generating a tabulated summary of the estimation results using 
stargazer(). 


# gather robust standard errors in a list 


rob_se <- list(sqrt(diag(vcovHC(cig_ivreg diff1l, type = "HCi"))), 
sqrt (diag(vcovHC(cig_ivreg diff2, type = "HC1"))), 
sqrt (diag(vcovHC(cig_ivreg diff3, type = "HC1i")))) 


# generate table 
stargazer(cig_ivreg diff1, cig_ivreg diff2,cig_ivreg_diff3, 
header = FALSE, 


type = "html", 

omit.table.layout = "n", 

digits = 3, 

column.labels = c("IV: salestax", "IV: cigtax", "IVs: salestax, cigtax"), 


dep.var.labels.include = FALSE, 
dep.var.caption = "Dependent Variable: 1985-1995 Difference in Log per Pack Price", 
se = rob_se) 


Table 12.1 reports negative estimates of the coefficient on pricediff that are 
quite different in magnitude. Which one should we trust? This hinges on the 
validity of the instruments used. To assess this we compute F-statistics for the 
first-stage regressions of all three models to check instrument relevance. 


# first-stage regressions 

mod_relevancel <- lm(pricediff ~ salestaxdiff + incomediff) 
mod_relevance2 <- lm(pricediff ~ cigtaxdiff + incomediff) 

mod_relevance3 <- lm(pricediff ~ incomediff + salestaxdiff + cigtaxdiff) 


# check instrument relevance for model (1) 
linearHypothesis(mod_relevancel, 
"salestaxdiff = 0", 


12.4. APPLICATION TO THE DEMAND FOR CIGARETTES 347 


Table 12.1: TSLS Estimates of the Long-Term Elasticity of the Demand for 
Cigarettes using Panel Data 








Dependent variable: 1985-1995 difference in log per pack price 











IV: salestax IV: cigtax IVs: salestax, cigtax 

(1) (2) (3) 
pricediff —0.938*** —1.343 —1.202 
incomediff 0.526*** 0.428 0.462 
Constant —0.118*** —0.017 —0.052 
(0.028) (0.484) (0.250) 

Observations 48 48 48 
R? 0.550 0.520 0.547 
Adjusted R? 0.530 0.498 0.526 
Residual Std. Error (df = 45) 0.091 0.094 0.091 








vcov = vcovHC, type = "HC1") 
#> Linear hypothesis test 
#> 
#> Hypothesis: 
#> salestazdiff = 0 
#> 
#> Model 1: restricted model 
#> Model 2: pricediff ~ salestaxdiff + incomediff 


#> 

#> Note: Coefficient covariance matriz supplied. 

#> 

#> Res.Df Df E Pr (>F) 

#> 1 46 

#> 2 45 1 28.445 3.009e-06 *** 

eS 

H> Signi j- Codes: O Crk TOTOO EXE SOROL UR VOT OS Ont) Et 


# check instrument relevance for model (2) 
linearHypothesis(mod_relevance2, 

Vicsipt axdact f= OMe 

vcov = vcovHC, type = "HCi") 
#> Linear hypothesis test 
#> 
#> Hypothesis: 
#> cigtaxzdtff = 0 
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#> 
#> Model 1: restricted model 
#> Model 2: pricediff ~ cigtaxdiff + incomediff 


#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df ESS Pricer) 

#> 1 46 

#> 2 45 1 98.034 7.09e-13 *** 

#> --- 

A> OUGNG. Codes: (OU eeeT OTOOTO OI WAO OS Ont! Id 


# check instrument relevance for model (3) 

linearHypothesis(mod_relevance3, 
c("salestaxdiff = 0", "cigtaxdiff = 0"), 
vcov = vcovHC, type = "HCi") 

#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> salestardiff = 0 

#> cigtazdiff = O 

#> 

#> Model 1: restricted model 

#> Model 2: pricediff ~ incomediff + salestardiff + cigtaxdiff 

#> 

#> Note: Coefficient covariance matrix supplied. 


#> 

#> Res Df Df F PrF) 

#> 1 46 

#> 2 44 276-916 4.339e-15 *** 

#> --- 

H Signi- codes TOLA O TOOT AA O OTE O OS EEO T] 


We also conduct the overidentifying restrictions test for model three which is 
the only model where the coefficient on the difference in log prices is overiden- 
tified (m = 2, k = 1) such that the J-statistic can be computed. To do this 
we take the residuals stored in cig_ivreg_diff3 and regress them on both in- 
struments and the presumably exogenous regressor incomediff. We again use 
linearHypothesis() to test whether the coefficients on both instruments are 
zero which is necessary for the exogeneity assumption to be fulfilled. Note that 
with test = "Chisq" we obtain a chi-squared distributed test statistic instead 
of an F-statistic. 
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# compute the J-statistic 
cig_iv_OR <- lm(residuals(cig_ivreg diff3) ~ incomediff + salestaxdiff + cigtaxdiff) 


cig _OR_test <- linearHypothesis(cig_iv_OR, 
c("salestaxdiff = 0", "cigtaxdiff = 0"), 
test = "Chisq") 


cig OR_test 
#> Linear hypothesis test 
#> 


#> Hypothesis: 

#> salestazdiff = 0 

#> cigtaxdiff = 0 

#> 

#> Model 1: restricted model 

#> Model 2: residuals(cig ivreg dif f3) ~ incomediff + salestaxdiff + cigtaxdiff 
#> 

#>  Res.Df RSS Df Sum of Sq Chisq Pr(>Chisq) 


#> 1 46 0.37472 

#> 2 44 0.33695 2 0.037769 4.932 0.08492 . 

OS ee 

#> Signif. codes: O '*#*' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Caution: In this case the p-Value reported by linearHypothesis() is wrong 
because the degrees of freedom are set to 2. This differs from the degree of 
overidentification (m— k = 2—1 = 1) so the J-statistic is x? distributed instead 
of following a x2 distribution as assumed defaultly by linearHypothesis(). 
We may compute the correct p-Value using pchisq(). 


# compute correct p-value for J-statistic 
pchisq(cig OR_test[2, 5], df = 1, lower.tail = FALSE) 
#> [1] 0.02636406 


Since this value is smaller than 0.05 we reject the hypothesis that both instru- 
ments are exogenous at the level of 5%. This means one of the following: 


1. The sales tax is an invalid instrument for the per-pack price. 

2. The cigarettes-specific sales tax is an invalid instrument for the per-pack 
price. 

3. Both instruments are invalid. 


The book argues that the assumption of instrument exogeneity is more likely 
to hold for the general sales tax (see Chapter 12.4 of the book) such that the 
IV estimate of the long-run elasticity of demand for cigarettes we consider the 
most trustworthy is —0.94, the TSLS estimate obtained using the general sales 
tax as the only instrument. 
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The interpretation of this estimate is that over a 10-year period, an increase in 
the average price per package by one percent is expected to decrease consump- 
tion by about 0.94 percentage points. This suggests that, in the long run, price 
increases can reduce cigarette consumption considerably. 


12.5 Where Do Valid Instruments Come From? 


Chapter 12.5 of the book presents a comprehensive discussion of approaches to 
find valid instruments in practice by the example of three research questions: 


e Does putting criminals in jail reduce crime? 
e Does cutting class sizes increase test scores? 
e Does aggressive treatment of heart attacks prolong lives? 


This section is not directly related to applications in R which is why we do not 
discuss the contents here. We encourage you to work through this on your own. 


Summary 


ivreg() from the package AER provides convenient functionalities to estimate 
IV regression models in R. It is an implementation of the TSLS estimation 
approach. 


Besides treating IV estimation, we have also discussed how to test for weak 
instruments and how to conduct an overidentifying restrictions test when there 
are more instruments than endogenous regressors using R. 


An empirical application has shown how ivreg() can be used to estimate the 
long-run elasticity of demand for cigarettes based on CigarettesSw, a panel data 
set on cigarette consumption and economic indicators for all 48 continental U.S. 
states for 1985 and 1995. Different sets of instruments were used and it has been 
argued why using the general sales tax as the only instrument is the preferred 
choice. The estimate of the demand elasticity deemed the most trustworthy 
is —0.94. This estimate suggests that there is a remarkable negative long-run 
effect on cigarette consumption of increasing prices. 


12.6 Exercises 


Chapter 13 


Experiments and 
Quasi-Experiments 


This chapter discusses statistical tools that are commonly applied in program 
evaluation, where interest lies in measuring the causal effects of programs, poli- 
cies or other interventions. An optimal research design for this purpose is what 
statisticians call an ideal randomized controlled experiment. The basic idea is 
to randomly assign subjects to two different groups, one that receives the treat- 
ment (the treatment group) and one that does not (the control group) and to 
compare outcomes for both groups in order to get an estimate of the average 
treatment effect. 


Such experimental data is fundamentally different from observational data. For 
example, one might use a randomized controlled experiment to measure how 
much the performance of students in a standardized test differs between two 
classes where one has a “regular”" student-teacher ratio and the other one has 
fewer students. The data produced by such an experiment would be different 
from, e.g., the observed cross-section data on the students’ performance used 
throughout Chapters 4 to 8 where class sizes are not randomly assigned to 
students but instead are the results of an economic decision where educational 
objectives and budgetary aspects were balanced. 


For economists, randomized controlled experiments are often difficult or even 
indefeasible to implement. For example, due to ethic, moral and legal reasons 
it is practically impossible for a business owner to estimate the causal effect on 
the productivity of workers of setting them under psychological stress using an 
experiment where workers are randomly assigned either to the treatment group 
that is under time pressure or to the control group where work is under regular 
conditions, at best without knowledge of being in an experiment (see the box 
The Hawthorne Effect on p. 528 of the book). 


However, sometimes external circumstances produce what is called a quasi- 
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experiment or natural experiment. This “as if” randomness allows for estimation 
of causal effects that are of interest for economists using tools which are very 
similar to those valid for ideal randomized controlled experiments. These tools 
draw heavily on the theory of multiple regression and also on IV regression (see 
Chapter 12). We will review the core aspects of these methods and demonstrate 
how to apply them in R using the STAR data set (see the description of the 
data set). 


The following packages and their dependencies are needed for reproduction of 
the code chunks presented throughout this chapter: 


e AER (Kleiber and Zeileis, 2020) 

e dplyr (Wickham et al., 2020) 

e MASS (Ripley, 2020) 

e mvtnorm (Genz et al., 2020) 

e rddtools (Stigler and Quast, 2020) 
e scales (Wickham and Seidel, 2020) 
e stargazer(Hlavac, 2018) 

e tidyr (Wickham and Henry, 2020) 


Make sure the following code chunk runs without any errors. 


library (AER) 
library (dplyr) 
library (MASS) 
library (mvtnorm) 
library (rddtools) 
library (scales) 
library (stargazer) 
library (tidyr) 


13.1 Potential Outcomes, Causal Effects and 
Idealized Experiments 
We now briefly recap the idea of the average causal effect and how it can be esti- 


mated using the differences estimator. We advise you to work through Chapter 
13.1 of the book for a better understanding. 


Potential Outcomes and the average causal effect 


A potential outcome is the outcome for an individual under a potential treat- 
ment. For this individual, the causal effect of the treatment is the difference 
between the potential outcome if the individual receives the treatment and the 
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potential outcome if she does not. Since this causal effect may be different for 
different individuals and it is not possible to measure the causal effect for a 
single individual, one is interested in studying the average causal effect of the 
treatment, hence also called the average treatment effect. 


In an ideal randomized controlled experiment the following conditions are ful- 
filled: 


1. The subjects are selected at random from the population. 
2. The subjects are randomly assigned to treatment and control group. 


Condition 1 guarantees that the subjects’ potential outcomes are drawn ran- 
domly from the same distribution such that the expected value of the causal 
effect in the sample is equal to the average causal effect in the distribution. Con- 
dition 2 ensures that the receipt of treatment is independent from the subjects’ 
potential outcomes. If both conditions are fulfilled, the expected causal effect 
is the expected outcome in the treatment group minus the expected outcome in 
the control group. Using conditional expectations we have 


Average causal effect = E(Y;|X; = 1) — E(Y;|X; = 0), 


where X; is a binary treatment indicator. 


The average causal effect can be estimated using the differences estimator, which 
is nothing but the OLS estimator in the simple regression model 


Yi = Pot bıXi + ui , i=1,...,0, (13.1) 


where random assignment ensures that E(u;|X;) = 0. 


The OLS estimator in the regression model 





Y; = bo + BX, + b2Waii + -0e + BiprWri + ui , t=1,...,n (13.2) 


with additional regressors W1,..., W, is called the differences estimator with 
additional regressors. It is assumed that treatment X; is randomly assigned 
so that it is independent of the the pretreatment characteristic W;. This is 
assumption is called conditional mean independence and implies 


so the conditional expectation of the error u; given the treatment indicator X; 
and the pretreatment characteristic W; does not depend on the X;. Conditional 
mean independence replaces the first least squares assumption in Key Concept 
6.4 and thus ensures that the differences estimator of (61 is unbiased. The differ- 
ences estimator with additional regressors is more efficient than the differences 
estimator if the additional regressors explain some of the variation in the Y;. 
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13.2 Threats to Validity of Experiments 


The concepts of internal and external validity discussed in Key Concept 9.1 are 
also applicable for studies based on experimental and quasi-experimental data. 
Chapter 13.2 of the book provides a thorough explanation of the particular 
threats to internal and external validity of experiments including examples. We 
limit ourselves to a short repetition of the threats listed there. Consult the book 
for a more detailed explanation. 


Threats to Internal Validity 


1. 


Failure to Randomize 


If the subjects are not randomly assigned to the treatment group, then the 
outcomes will be contaminated with the effect of the subjects’ individual 
characteristics or preferences and it is not possible to obtain an unbiased 
estimate of the treatment effect. One can test for nonrandom assignment 
using a significance test (F-Test) on the coefficients in the regression model 


Xi = Pot biWui + + PoWri tui , t=1,...,n. 


Failure to Follow the Treatment Protocol 


If subjects do not follow the treatment protocol, i.e., some subjects in the 
treatment group manage to avoid receiving the treatment and/or some 
subjects in the control group manage to receive the treatment (partial 
compliance), there is correlation between X; und u; such that the OLS 
estimator of the average treatment effect will be biased. If there are data 
on both treatment actually recieved (X;) and initial random assignment 
(Zi), IV regression of the models (13.1) and (13.2) is a remedy. 


Attrition 


Attrition may result in a nonrandomly selected sample. If subjects sys- 
tematically drop out of the study after beeing assigned to the control or 
the treatment group (systematic means that the reason of the dropout is 
related to the treatment) there will be correlation between X; and u; and 
hence bias in the OLS estimator of the treatment effect. 


Experimental Effects 


If human subjects in treatment group and/or control group know that 
they are in an experiment, they might adapt their behaviour in a way 
that prevents unbiased estimation of the treatment effect. 


Small Sample Sizes 


As we know from the theory of linear regression, small sample sizes lead to 
imprecise estimation of the coefficients and thus imply imprecise estima- 
tion of the causal effect. Furthermore, confidence intervals and hypothesis 
test may produce wrong inference when the sample size is small. 
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Threats to External Validity 


1. Nonrepresentative Sample 


If the population studied and the population of interest are not sufficiently 
similar, there is no justification in generalizing the results. 


2. Nonrepresentative Program or Policy 


If the program or policy for the population studied differs considerably 
from the program (to be) applied to population(s) of interest, the results 
cannot be generalized. For example, a small-scale programm with low 
funding might have different effects than a widely available scaled-up pro- 
gram that is actually implemented. There are other factors like duration 
and the extent of monitoring that should be considered here. 


3. General Equilibrium Effects 


If market and/or environmental conditions cannot be kept constant when 
an internally valid program is implemented broadly, external validity may 
be doubtful. 


13.3 Experimental Estimates of the Effect of 
Class Size Reductions 


Experimental Design and the Data Set 


The Project Student-Teacher Achievement Ratio (STAR) was a large random- 
ized controlled experiment with the aim of asserting whether a class size reduc- 
tion is effective in improving education outcomes. It has been conducted in 80 
Tennessee elementary schools over a period of four years during the 1980s by 
the State Department of Education. 


In the first year, about 6400 students were randomly assigned into one of three 
interventions: small class (13 to 17 students per teacher), regular class (22 to 
25 students per teacher), and regular-with-aide class (22 to 25 students with a 
full-time teacher’s aide). Teachers were also randomly assigned to the classes 
they taught. The interventions were initiated as the students entered school 
in kindergarten and continued through to third grade. Control and treatment 
groups across grades are summarized in Table 13.1. 


Table 13.1: Control and treatment groups in the STAR experiment 


K 1 2 3 





Treatment Small class Small class Small class Small class 
1 
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K 1 2 3 
Treatment Regular class Regular class Regular class Regular class 
2 + aide + aide + aide + aide 
Control Regular class Regular class Regular class Regular class 


Each year, the students’ learning progress was assessed using the sum of the 
points scored on the math and reading parts of a standardized test (the Stanford 
Achievement Test). 


The STAR data set is part of the package AER. 


# load the package AER and the STAR dataset 
library (AER) 
data (STAR) 


head (STAR) shows that there is a variety of factor variables that describe student 
and teacher characteristics as well as various school indicators, all of which are 
separately recorded for the four different grades. The data is in wide format. 
That is, each variable has its own column and for each student, the rows contain 
observations on these variables. Using dim(STAR) we find that there are a total 
of 11598 observations on 47 variables. 


# get an overview 
head(STAR, 2) 


#> gender ethnicity birth stark stari star2 star3 readk read1 read2 read3 
#> 1122 female afam 1979 Q3 <NA> <NA> <NA> regular NA NA NA 580 
#> 1137 female cauc 1980 Q1 small small small small 447 507 568 587 
#> mathk mathi math2 math3  lunchk lunchi lunch2 lunch3 schoolk schooll 

#> 1122 NA NA NA 564 <NA>  <NA> <NA> free <NA> <NA> 

#> 1137 473 538 579 593 non-free free non-free free rural rural 

#> school2 school3 degreek degreel degree2 degree3 ladderk ladderi 

#> 1122 <NA> suburban <NA> <NA> <NA> bachelor <NA> <NA> 

#> 1137 rural rural bachelor bachelor bachelor bachelor leveli levell 

#> ladder2 ladder3 experiencek experiencel experience2 experience3 

#> 1122 <NA> level1 NA NA NA 30 

#> 1137 apprentice apprentice 7 Y 3 1 

#> tethnicityk tethnicityl tethnicity2 tethnicity3 systemk systemi system2 

#> 1122 <NA> <NA> <NA> cauc <NA> <NA> <NA> 

#> 1137 cauc cauc cauc cauc 30 30 30 

#> system3 schoolidk schoolid1 schoolid2 schoolid3 

#> 1122 22 <NA> <NA> <NA> 54 

#> 1137 30 63 63 63 63 

dim(STAR) 


it 115968 na 7: 
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# get variable names 


names (STAR) 

#> [1] "gender" "ethnicity" "birth" "stark" Ustari" 

#> [6] "star2" "star3" "readk" negaisi “read” 

#> [11] "read3" "mathk" "mathi" "math2" "math3" 

#> [16] "lunchk" "Lunchi" "Lunch2" "Lunch3" "schoolk" 
#> [21] "schooli" "school2" "school3" "degreek" "degree1" 
#> [26] "degree2" "degree3" "Ladderk" "Ladderi" "Ladder2" 
#> [31] "ladder3" "expertencek" "exrpertence1" "experience2" "“experience3" 
#> [36] "tethnicityk" "tethnicity1" "tethnicity2" "tethnicity3" "systemk" 
#> [41] "systemi" "systeme" "system3" "schoolidk" "schoolidi" 
#> [46] "schoolid2" "schoolid3" 


A majority of the variable names contain a suffix (k, 1, 2 or 3) stating the grade 
which the respective variable is referring to. This facilitates regression analysis 
because it allows to adjust the formula argument in 1m() for each grade by 
simply changing the variables’ suffixes accordingly. 


The outcome produced by head() shows that some values recorded are NA and 
<NA>, i.e., there is no data on this variable for the student under consideration. 
This lies in the nature of the data: for example, take the first observation 
STAR[1,]. 


In the output of head(STAR, 2) we find that the student entered the experiment 
in third grade in a regular class, which is why the class size is recorded in star3 
and the other class type indicator variables are <NA>. For the same reason there 
are no recordings of her math and reading score but for the third grade. It is 
straightforward to only get her non-NA/<NA> recordings: simply drop the NAs 
using !is.naQ. 


# drop NA recordings for the first observation and print to the console 
STAR[1, !is.na(STAR[1, ])] 


#> gender ethnicity birth star3 read3 math3 lunch3 school3 degree3 
#> 1122 female afam 1979 Q3 regular 580 564 free suburban bachelor 
#> ladder3 experience3 tethnicity3 system3 schoolid3 
#> 1122 leveli 30 cauc 22 54 


is.na(STAR[1, ]) returns a logical vector with TRUE at positions that corre- 
spond to <NA> entries for the first observation. The ! operator is used to invert 
the result such that we obtain only non-<NA> entries for the first observations. 


In general it is not necessary to remove rows with missing data because 1m() 
does so by default. Missing data may imply a small sample size and thus may 
lead to imprecise estimation and wrong inference This is, however, not an issue 
for the study at hand since, as we will see below, sample sizes lie beyond 5000 
observations for each regression conducted. 
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Analysis of the STAR Data 


As can be seen from Table 13.1 there are two treatment groups in each grade, 
small classes with only 13 to 17 students and regular classes with 22 to 25 stu- 
dents and a teaching aide. Thus, two binary variables, each being an indicator 
for the respective treatment group, are introduced for the differences estimator 
to capture the treatment effect for each treatment group separately. This yields 
the population regression model 


Yi = Bo + Bi SmallClass; + 82Reg Aide; + uj, (13.3) 


with test score Y;, the small class indicator SmallClass; and RegAide;, the 
indicator for a regular class with aide. 


We reproduce the results presented in Table 13.1 of the book by performing 
the regression (13.3) for each grade separately. For each student, the dependent 
variable is simply the sum of the points scored in the math and reading parts, 
constructed using I(). 


# compute differences Estimates for each grades 
fmk <- Im(I(readk + mathk) ~ stark, data = STAR) 
fmi <- Ilm(I(readi + math1) ~ stari, data = STAR) 
fm2 <- Ilm(I(read2 + math2) ~ star2, data = STAR) 
fm3 <- 1lm(I(read3 + math3) ~ star3, data = STAR) 


# obtain coefficient matriz using robust standard errors 
coeftest (fmk, vcov = vcovHC, type= "HC1i") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 918. 04289 1.63339 562.0473 < 2.2e-16 *** 
#> starksmall 13.89899 2.45409 5.6636 1.554e-08 *** 


#> starkregulartaide 0.31394 2.27098 0.1382 0.8901 
 —- 


Fo Si gnij: codes OFE O OOI Fey O OT kO OS 2 Ont | ie. 
coeftest(fm1, vcov = vcovHC, type= "HC1") 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 1039.3926 1.7846 582.4321 < 2.2e-16 *** 
#> starismall 29.7808 2.8311 10.5190 < 2.2e-16 *** 


#> stariregulartaide 11.9587 2.6520 4.5093 6.62e-06 *** 
j —- 
Ho Signi- (COGES: MORL rrr O O01 CAAO ONEA OOo sO ds eed 
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coeftest(fm2, vcov = vcovHC, type= "HC1i") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 1157. 8066 1.8151 637.8820 < 2.2e-16 *** 
#> star2small 19.3944 2.7117 7.1522 9.55e-13 kkk 
#> star2regulartaide 3.4791 2.5447 1.3672 0.1716 
jt 

FOU Sunt A codes ONL eee NO MOOT S| SOOT SONOS Onl la eT 
coeftest(fm3, vcov = vcovHC, type= "HC1i") 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 1228. 50636 1.68001 731.2483 < 2.2e-16 *** 
#> star3small 15.58660 2.39604 6.5051 8.393e-11 *** 


#> star3regulartaide -0.29094 2.27271 -0.1280 0.8981 
#> --- 
#2 Signij- codes: TO U#x*" 10001 Ie! 1OnO1 "O05 8 O71) oe 


We gather the results and present them in a table using stargazer(). 


# compute robust standard errors for each model and gather them in a list 
rob_se_1 <- list(sqrt(diag(vcovHC(fmk, type = "HCi"))), 

sqrt (diag(vcovHC(fm1, type = "HCi"))), 

sqrt (diag(vcovHC(fm2, type = "HC1i"))), 

sqrt (diag(vcovHC(fm2, type = "HC1")))) 


library (stargazer) 


stargazer (fmk,fm1,fm2,fm3, 
title = "Project STAR: Differences Estimates", 
header = FALSE, 
type = latex, 
model.numbers = F, 
omitetablenllayouti = Eni 
digits = 3, 
column label si = COK Ati DU) 
dep.var.caption = "Dependent Variable: Grade", 
dep.var.labels.include = FALSE, 
se = rob_se_1) 


The estimates presented in Table 13.2 suggest that the class size reduction im- 
proves student performance. Except for grade 1, the estimates of the coefficient 
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Table 13.2: Project STAR - Differences Estimates 








Dependent Variable: Grade 











K 1 2 3 

starksmall 13.899*** 

(2.454) 
starkregular+aide 0.314 

(2.271) 
starlsmall 29.781*** 

(2.831) 
starlregular+aide 11.959*** 
(2.652) 
star2small 19.394*** 
(2.712) 
star2regular+aide 3.479 
(2.545) 

star3small 15.587 
star3regular+aide —0.291 
Constant 918.043*** 1,039.393*** 1,157.807*** 1,228.506*** 

(1.633) (1.785) (1.815) (1.815) 
Observations 5,786 6,379 6,049 5,967 
R? 0.007 0.017 0.009 0.010 
Adjusted R? 0.007 0.017 0.009 0.010 
Residual Std. Error 73.490 (df = 5783) 90.501 (df = 6376) 83.694 (df = 6046) 72.910 (df = 5964) 


F Statistic 21.263*** (df = 2; 5783) 56.341*** (df = 2; 6376) 28.707*** (df = 2; 6046) 


30.250*** (df = 2; 5964) 
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on SmallClass are roughly of the same magnitude (the estimates lie between 
13.90 and 19.39 points) and they are statistically significant at 1%. Further- 
more, a teaching aide has little, possibly zero, effect on the performance of the 
students. 


Following the book, we augment the regression model (13.3) by different sets of 
regressors for two reasons: 


1. If the additional regressors explain some of the observed variation in the 
dependent variable, we obtain more efficient estimates of the coefficients 
of interest. 

2. If the treatment is not received at random due to failures to follow the 
treatment protocol (see Chapter 13.3 of the book), the estimates obtained 
using (13.3) may be biased. Adding additional regressors may solve or 
mitigate this problem. 


In particular, we consider the following student and teacher characteristics 


e experience — Teacher’s years of experience 

e boy — Student is a boy (dummy) 

e lunch — Free lunch eligibility (dummy) 

e black — Student is African-American (dummy) 

e race — Student’s race is other than black or white (dummy) 
e schoolid — School indicator variables 


in the four population regression specifications 














Yi =Co + 6, SmallClass; + B2Reg Aide; + ui, (13.4) 
Yi =Co0 + 6, SmallClass; + 62RegAide; + bz3experience; + ui, (13.5) 
Yi =Co0 + 6, SmallClass; + 62RegAide; + Bzexperience; + schoolid + ui, 
(13.6) 
and 


Y; =Co + 6, SmallClass; + 62RegAide; + B3experience; + Baboy + Bslunch 
(13.7) 


+ Beblack + rrace + schoolid + ui. (13.8) 


Prior to estimation, we perform some subsetting and data wrangling using func- 
tions from the packages dplyr and tidyr. These are both part of tidyverse, 
a collection of R packages designed for data science and handling big datasets 
(see the official site for more on tidyverse packages). The functions %>%, 
transmute() and mutate() are sufficient for us here: 


e %>% allows to chain function calls. 
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e transmute() allows to subset the data set by naming the variables to be 
kept. 

e mutate() is convenient for adding new variables based on existing ones 
while preserving the latter. 


The regression models (13.4) to (13.8) require the variables gender, ethnicity, 
stark, readk, mathk, lunchk, experiencek and schoolidk. After dropping 
the remaining variables using transmute(), we use mutate() to add three ad- 
ditional binary variables which are derivatives of existing ones: black, race and 
boy. They are generated using logical statements within the function ifelse(). 


# load packages 'dplyr' and 'tidyr' for data wrangling functionalities 
library (dplyr) 
library (tidyr) 


# generate subset with kindergarten data 
STARK <- STAR %>% 
transmute (gender, 
ethnicity, 
stark, 
readk, 
mathk, 
lunchk, 
experiencek, 
schoolidk) %>% 
mutate(black = ifelse(ethnicity == "afam", 1, 0), 
race = ifelse(ethnicity == "afam" | ethnicity == "cauc", 1, 0), 
boy = ifelse(gender == "male", 1, 0)) 


# estimate the models 
gradeK1 <- lm(I(mathk + readk) ~ stark + experiencek, 
data = STARK) 


gradeK2 <- lm(I(mathk + readk) ~ stark + experiencek + schoolidk, 
data = STARK) 


gradeK3 <- lm(I(mathk + readk) ~ stark + experiencek + boy + lunchk 
+ black + race + schoolidk, 
data = STARK) 


For brevity, we exclude the coefficients for the indicator dummies in 
coeftest()’s output by subsetting the matrices. 
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# obtain robust inference on the significance of coefficients 
coeftest (gradeK1, vcov. = vcovHC, type = "HCi") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(C>/t/) 

#> (Intercept) 904. 72124 2.22235 407.1020 < 2.2e-16 *** 

#> starksmall 14.00613 2.44704 5.7237 1.095e-08 *** 

#> starkregulartaide -0.60058 2.25430 -0.2664 0.7899 

#> expertencek 1.46903 0.16929 8.6778 < 2.2e-16 *** 

So 

P Sugih a codes E ORSAI Wo) (oronl Vera" sO (onl Ves! (oti) Ne laak eV al 
coeftest(gradeK2, vcov. = vcovHC, type = "HCi")[1:4, ] 

#> Estimate Std. Error t value Pr(>/tl) 
#> (Intercept) 925.6748750 7.6527218 120.9602155 0.000000e+00 
#> starksmall 15.9330822 2.2411750 7.1092540 1.310324e-12 
#> starkregulartaide 1.2151960 2.0353415 0.5970477 5.504993e-01 
#> experiencek 0.7431059 0.1697619 4.3773429 1.222880e-05 
coeftest (gradeK3, vcov. = vcovHC, type = "HCi")[1:7, ] 

#> Estimate Std. Error t value Pr(>/t/) 
#> (Intercept) 937.6831330 14.3726687 65.2407117 0.000000e+00 
#> starksmall 15.8900507 2.1551817 7.3729516 1.908960e-13 
#> starkregulartaide 1.7869378 1.9614592 0.9110247 3.623211e-01 
#> experiencek 0.6627251 0.1659298 3.9940097 6.578846e-05 
#> boy -12.0905123 1.6726331 -7.2284306 5.533119e-13 
#> Lunchkfree -34.7033021 1.9870366 -17.4648529 1.437931e-66 
#> black =25.4900130' 34986918 —7.2685776 4.125252e-13 


We now use stargazer() to gather all relevant information in a structured 
table. 


# compute robust standard errors for each model and gather them in a list 
rob_se_2 <- list(sqrt(diag(vcovHC(fmk, type = "HC1i"))), 

sqrt (diag(vcovHC(gradeK1, type = "HCi"))), 

sqrt (diag(vcovHC(gradeK2, type = "HCi"))), 

sqrt (diag(vcovHC(gradeK3, type = "HCi")))) 


stargazer(fmk, fm1, fm2, fm3, 
title = "Project STAR - Differences Estimates with 
Additional Regressors for Kindergarten", 
header = FALSE, 
type = latex, 
model.numbers = F, 


omit- table layout) = Val’; 
digits = 3, 
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dep.var.caption = "Dependent Variable: Test Score in Kindergarten", 
dep.var.labels.include = FALSE, 

se = rob_se_2) 


The results in column (1) of Table 13.3 just a repeat the results obtained for 
(13.3). Columns (2) to (4) reveal that adding student characteristics and school 
fixed effects does not lead to substantially different estimates of the treatment 
effects. This result makes it more plausible that the estimates of the effects 
obtained using model (13.3) do not suffer from failure of random assignment. 
There is some decrease in the standard errors and some increase in R?, implying 
that the estimates are more precise. 


Because teachers were randomly assigned to classes, inclusion of school fixed 
effect allows us to estimate the causal effect of a teacher’s experience on test 
scores of students in kindergarten. Regression (3) predicts the average effect of 
10 years experience on test scores to be 10-0.74 = 7.4 points. Be aware that the 
other estimates on student characteristics in regression (4) do not have causal 
interpretation due to nonrandom assignment (see Chapter 13.3 of the book for 
a detailed discussion). 


Are the estimated effects presented in Table 13.3 large or small in a practical 
sense? Let us translate the predicted changes in test scores to units of stan- 
dard deviation in order to allow for a comparison (see Section 9.4 for a similar 
argument). 


# compute the sample standard deviations of test scores 

SSD <- c("K" = sd(na.omit(STAR$readk + STAR$mathk)), 
"1" = sd(na.omit(STAR$readi + STAR$math1)), 
wpU sd(na.omit(STAR$read2 + STAR$math2)), 
Mew sd(na.omit(STAR$read3 + STAR$math3))) 


# translate the effects of small classes to standard deviations 
Small <- c("K" = as.numeric(coef (fmk) [2]/SSD[1]), 

"4" = as.numeric(coef(fm1) [2]/SSD[2]), 

"2" = as.numeric(coef (fm2) [2]/SSD[3]), 

"3" = as.numeric(coef(fm3) [2]/SSD[4])) 


# adjust the standard errors 

SmallSE <- c("K" = as.numeric(rob_se_1[[1]][2]/SSD[1]), 
"1" = as.numeric(rob_se_1[[2]] [2]/SSD[2]), 
"2" = as.numeric(rob_se_1[[3]] [2]/SSD[3]), 
"3" = as.numeric(rob_se_1[[4]] [2]/SSD[4])) 


# translate the effects of regular classes with aide to standard deviations 
RegAide<- c("K" = as.numeric (coef (fmk) [3]/SSD[1]), 
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"4" = as.numeric (coef (fm1) [3]/SSD[2]), 
"2" = as.numeric (coef (fm2) [3]/SSD[3]), 
"3" = as.numeric (coef (fm3) [3]/SSD [4] )) 


# adjust the standard errors 


RegAideSE <- c("K" = as 
"4" = as 
"9" = ag 
"2" = ag 


# gather the results in 


a data. frame and round 


df <- t(round(data.frame( 
Small, SmallSE, RegAide, RegAideSE, SSD), 


It is fairly easy to turn the data.frame df into a table. 


digits = 2)) 


# generate a simple table using stargazer 


stargazer (df, 


title = "Estimated Class Size Effects 


(in Units of Standard Deviations)", 


type = "htm" 


summary = FALSE, 
header = FALSE 


) 


.numeric(rob_se_1[[1]][3]/SSD[1]), 
.numeric(rob_se_1[[2]] [3]/SSD[2]), 
.numeric(rob_se_1[[3]] [3]/SSD[3]), 
.numeric(rob_se_1[[4]] [3]/SSD[4])) 


Table 13.4: Estimated Class Size Effects (in Units of Standard Deviations) 











K 1 2 3 
Small 0.190 0.330 0.230 0.210 
SmallSE 0.030 0.030 0.030 0.040 
RegAide 0 0.130 0.040 0 
RegAideSE 0.030 0.030 0.030 0.030 
SSD 73.750 91.280 84.080 73.270 





The estimated effect of a small classes is largest for grade 1. As pointed out in 
the book, this is probably because students in the control group for grade 1 did 
poorly on the test for some unknown reason or simply due to random variation. 
The difference between the estimated effect of being in a small class and being 
in a regular classes with an aide is roughly 0.2 standard deviations for all grades. 
This leads to the conclusion that the effect of being in a regular sized class with 
an aide is zero and the effect of being in a small class is roughly the same for 


all grades. 
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The remainder of Chapter 13.3 in the book discusses to what extent these exper- 
imental estimates are comparable with observational estimates obtained using 
data on school districts in California and Massachusetts in Chapter 9. It turns 
out that the estimates are indeed very similar. Please refer to the aforemen- 
tioned section in the book for a more detailed discussion. 


13.4 Quasi Experiments 


In quasi-experiments, “as if” randomness is exploited to use methods similar to 
those that have been discussed in the previous chapter. There are two types of 
quasi-experiments:1 


1. Random variations in individual circumstances allow to view the treat- 
ment “as if” it was randomly determined. 


2. The treatment is only partially determined by “as if” random variation. 


The former allows to estimate the effect using either model (13.2), i.e., the 
difference estimator with additional regressors, or, if there is doubt that the “as 
if” randomness does not entirely ensure that there are no systematic differences 
between control and treatment group, using the differences-in-differences (DID) 
estimator. In the latter case, an IV approach for estimation of a model like 
(13.2) which uses the source of “as if” randomness in treatment assignment as 
the instrument may be applied. 


Some more advanced techniques that are helpful in settings where the treat- 
ment assignment is (partially) determined by a threshold in a so-called running 
variable are sharp regression discontinuity design (RDD) and fuzzy regression 
discontinuity design (FRDD). 


We briefly review these techniques and, since the book does not provide any 
empirical examples in this section, we will use our own simulated data in a 
minimal example to discuss how DID, RDD and FRDD can be applied in R. 


The Differences-in-Differences Estimator 


In quasi-experiments the source of “as if” randomness in treatment assignment 
can often not entirely prevent systematic differences between control and treat- 
ment groups. This problem was encountered by Card and Krueger (1994) who 
use geography as the “as if” random treatment assignment to study the effect 
on employment in fast-food restaurants caused by an increase in the state mini- 
mum wage in New Jersey in the year of 1992. Their idea was to use the fact that 





1See Chapter 13.4 of the book for some example studies that are based on quasi- 
experiments. 
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the increase in minimum wage applied to employees in New Jersey (treatment 
group) but not to those living in neighboring Pennsylvania (control group). 


It is quite conceivable that such a wage hike is not correlated with other deter- 
minants of employment. However, there still might be some state-specific dif- 
ferences and thus differences between control and treatment group. This would 
render the differences estimator biased and inconsistent. Card and Krueger 
(1994) solved this by using a DID estimator: they collected data in February 
1992 (before the treatment) and November 1992 (after the treatment) for the 
same restaurants and estimated the effect of the wage hike by analyzing differ- 
ences in the differences in employment for New Jersey and Pennsylvania before 
and after the increase.2 The DID estimator is 





Aaifts-in- diffs = aa _ T a _ aaa ae yee 
(13.9) 
aye E ao (13.10) 
with 
= „befor i 
e er ote sample average in the treatment group before the 
treatment 
— $ fi e K 
. Y PEA the sample average in the treatment group after the treat- 
ment 
E „befor , 
. Y roaepenp psieTo the sample average in the control group before the treat- 
ment 
= fi 
e Yortimontatter he sample average in the control group after the treat- 
ment. 


We now use R to reproduce Figure 13.1 of the book. 


# initialize plot and add control group 
plot(c(0; 1D e6, 8); 


type= opr 
yam =e 2 
zlim = c03 LD 


main = "The Differences-in-Differences Estimator", 
xlab = "Period", 

ylab = uyu; 

col = "steelblue", 





2 Also see the box What is the Effect on Employment of the Minimum Wage? in Chapter 
13.4 of the book. 
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ll 
Q 
= 
fo) 


axis(1, at 
axis at -CO i) 


# add treatment group 
poiats(c (0, ens) parc Cree 9, 11), 
col = "darkred", 

pch = 20) 


# add line segments 
lines(c(0 1), c(7, 11), col = "darkred") 
lines(c(0O, 1), c(6, 8), col = "steelblue") 


> 1), labels = c("before", “after")) 


lines(c(0, 1), c(7, 9), col = "darkred", lty = 2) 
lines(c(1, 1), c(9, 11), col = "black", lty = 2, lwd = 2) 


# add annotations 


text(1, 10, expression(hat (beta) [1]“{DID}), cex 


text(0, 5.5, "s. mean control", cex = 0.8 


O26 POS 
4) 


text(0, 6.8, "s. mean treatment", cex = 0.8 , pos = 4) 
text(1, 7.9, "s. mean control", cex = 0.8 , pos = 4) 
text(1, 11.1, "s. mean treatment", cex = 0.8 , pos = 4) 


The Differences-in-—Differences Estimator 


s. mean contro 








s. mean treat 


s. mean cont 








] 
before 


Period 


after 


The DID estimator (13.10) can also be written in regression notation: 


the OLS estimator of 6; in 


AY; = Bo + b1 Xi + ui, 


369 


ppp is 


(13.11) 
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where AY; denotes the difference in pre- and post-treatment outcomes of indi- 
vidual 7 and X; is the treatment indicator. 


Adding additional regressors that measure pre-treatment characteristics to 
(13.11) we obtain 





AY; = Bo + Pi Xi + b2Wii + +++ + Bir Wri + Ui, (13.12) 


the difference-in-differences estimator with additional regressors. The addi- 
tional regressors may lead to a more precise estimate of p1. 


We keep things simple and focus on estimation of the treatment effect using 
DID in the simplest case, that is a control and a treatment group observed for 
two time periods — one before and one after the treatment. In particular, we 
will see that there are three different ways to proceed. 


First, we simulate pre- and post-treatment data using R. 


# set sample size 
n <- 200 


# define treatment effect 
TEffect <- 4 


# generate treatment dummy 
TDummy <- c(rep(0, n/2), rep(1, n/2)) 


# simulate pre- and post-treatment values of the dependent variable 
y_pre <- 7 + rnorm(n) 

y_pre[i:n/2] <- y_pre[i:n/2] - 1 

y_post <- 7 + 2 + TEffect * TDummy + rnorm(n) 

y_post[1i:n/2] <- y_post[1:n/2] - 1 


Next plot the data. The function jitter () is used to add some artificial disper- 
sion in the horizontal component of the points so that there is less overplotting. 
The function alpha() from the package scales allows to adjust the opacity of 
colors used in plots. 


library (scales) 


pre <- rep(0, length(y_pre [TDummy==0] )) 
post <- rep(1, length(y_pre[TDummy==0] ) ) 


# plot control group in t=1 

plot(jitter(pre, 0.6), 
y_pre[TDummy == 0], 
ylim = c(0, 16), 
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col = alpha("steelblue", 0.3), 

joxel ah, =" PAO), 

xlim = c(-0.5, 1.5), 

ylab = Yeu, 

xlab = "Period", 

xXaxt = "n", 

main = "Artificial Data for DID Estimation") 


axis(1, at = c(0, 1), labels = c("before", "after")) 


# add treatment group in t=1 
points(jitter(pre, 0.6), 
y_pre[TDummy == 1], 
col = alpha("darkred", 0.3), 
pch = 20) 


# add control group in t=2 
points(jitter(post, 0.6), 
y_post[TDummy == 0], 
col = alpha("steelblue", 0.5), 
pch = 20) 


# add treatment group in t=2 
points(jitter(post, 0.6), 
y_post([TDummy == 1], 
col = alpha("darkred", 0.5), 
pch = 20) 


Artificial Data for DID Estimation 





15 
| 
Ger 











] ] 
before after 


Period 


Observations from both the control and treatment group have a higher mean 
after the treatment but that the increase is stronger for the treatment group. 


372 CHAPTER 13. EXPERIMENTS AND QUASI-EX PERIMENTS 


Using DID we may estimate how much of that difference is due to the treatment. 


It is straightforward to compute the DID estimate in the fashion of (13.10). 


# compute the DID estimator for the treatment effect 'by hand" 
mean(y_post[TDummy == 1]) - mean(y_pre[TDummy == 1]) - 
(mean(y_post[TDummy == 0]) - mean(y_pre[TDummy == 0])) 

#> [1] 3.960268 


Notice that the estimate is close to 4, the value chosen as the treatment effect 
TEffect above. Since (13.11) is a simple linear model, we may perform OLS 
estimation of this regression specification using 1m(). 


# compute the DID estimator using a linear model 
1m(I(y_post - y_pre) ~ TDummy) 


#> 

#> Call: 

#> lm(formula = I(y_post - y_pre) ~ TDummy) 
#> 

#> Coefficients: 

#> (Intercept) TDummy 

#> 2. 104 3.960 


We find that the estimates coincide. Furthermore, one can show that the DID 
estimate obtained by estimating specification (13.11) OLS is the same as the 
OLS estimate of rg in 


Y; =bo + 61D; + B2Period; + Bre(Period; x Di) + €i (13.13) 


where D; is the binary treatment indicator, Period; is a binary indicator for 
the after-treatment period and the Period; x D; is the interaction of both. 


As for (13.11), estimation of (13.13) using R is straightforward. See Chapter 8 
for a discussion of interaction terms. 


# prepare data for DID regression using the interaction term 
d <- data.frame("Y" = c(y_pre,y_post), 

"Treatment" = TDummy, 

"Period" = c(rep("1", n), rep("2", n))) 


# estimate the model 

lm(Y ~ Treatment * Period, data = d) 

#> 

#> Call: 

#> lm(formula = Y ~ Treatment * Period, data = d) 
#> 
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#> Coefficients: 
#> (Intercept) Treatment Period2 Treatment:Period2 
#> 5.858 LLIT 2.104 3.960 


As expected, the estimate of the coefficient on the interaction of the treatment 
dummy and the time dummy coincide with the estimates obtained using (13.10) 
and OLS estimation of (13.11). 


Regression Discontinuity Estimators 


Consider the model 





Yi =Bo + Bi Xi t+ BoWi + ui (13.14) 


and let 


5 1, W; >c 
0, Wi<c 


so that the receipt of treatment, X;, is determined by some threshold c of a 
continuous variable W;, the so called running variable. The idea of regression 
discontinuity design is to use observations with a W; close to c for estimation 
of 6,. Bı is the average treatment effect for individuals with W; = c which is 
assumed to be a good approximation to the treatment effect in the population. 
(13.14) is called a sharp regression discontinuity design because treatment as- 
signment is deterministic and discontinuous at the cutoff: all observations with 
W; < cdo not receive treatment and all observations where W; > c are treated. 


The subsequent code chunks show how to estimate a linear SRDD using R and 
how to produce plots in the way of Figure 13.2 of the book. 


# generate some sample data 
W <- runif(1000, -1, 1) 
y <- 3+ 2 * W+ 10 * (W>=0) + rnorm(1000) 


# load the package 'rddtools' 
library (rddtools) 


# construct rdd_data 
data <- rdd_data(y, W, cutpoint = 0) 


# plot the sample data 
plot (data, 
col = "steelblue", 
Cex, =- 0n35) 
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xlab Wes 
ylab = "Y") 


246 8 








i attat . 
-1.0 -0.5 0.0 0.5 1.0 





h=0.0205/0.0205,n bins=98 (49/49) 


The argument nbins sets the number of bins the running variable is divided 
into for aggregation. The dots represent bin averages of the outcome variable. 


We may use the function rdd_reg_1m() to estimate the treatment effect using 
model (13.14) for the artificial data generated above. By choosing slope = 
"same" we restrict the slopes of the estimated regression function to be the 
same on both sides of the jump at the cutpoint W = 0. 


# estimate the sharp RDD model 
rdd_mod <- rdd_reg_1lm(rdd_object = data, 
slope = "same" 
summary (rdd_mod) 
#> 
#> Call: 
#> lm(formula = y ~ ., data = dat_step1, weights = weights) 
#> 
#> Residuals: 


#> Min 1Q Median 3Q Maz 

#> 32361 0 6779 00039 TOT7113 310096) 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 2.93889 0.07082 41.50 <2e-16 *** 

#> D 10. 12692 0.12631 80.18 <2e-16 *** 

# 2 1.88249 0.11074 TOO <2e-16 *** 

#> --- 

#> Signi codes: (0 "xxx! TOTOO kk 0201 Ue! On05 "SOF t § Ia 


#> 
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#> Residual standard error: 1.019 on 997 degrees of freedom 
#> Multiple R-squared: 0.972, Adjusted R-squared: 0.972 
#> F-statistic: 1.732et04 on 2 and 997 DF, p-value: < 2.2e-16 


The coefficient estimate of interest is labeled D. The estimate is very close to 
the treatment effect chosen in the DGP above. 


It is easy to visualize the result: simply call plot() on the estimated model 
object. 


# plot the RDD model along with binned observations 
plot (rdd_mod, 

















cex = 0.35, 
col = "steelblue", 
žlab = “Wit, 
ylab = "Y") 
o _, , 
a | l 
> = 
foe) — 
oO = 
a+ 4 
Pomel ee ae 
I ] I | 
-1.0 -0.5 0.0 0.5 1.0 


WwW 
h=0.9961/0.9961,n bins=3 (1/2) 


As above, the dots represent averages of binned observations. 


So far we assumed that crossing of the threshold determines receipt of treatment 
so that the jump of the population regression functions at the threshold can be 
regarded as the causal effect of the treatment. 


When crossing the threshold c is not the only cause for receipt of the treatment, 
treatment is not a deterministic function of W;. Instead, it is useful to think of 
cas a threshold where the probability of receiving the treatment jumps. 


This jump may be due to unobservable variables that have impact on the prob- 
ability of being treated. Thus, X; in (13.14) will be correlated with the error 
u; and it becomes more difficult to consistently estimate the treatment effect. 
In this setting, using a fuzzy regression discontinuity design which is based an 
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IV approach may be a remedy: take the binary variable Z; as an indicator for 
crossing of the threshold, 


= I; W; > c 
“10, W; <c, 


and assume that Z; relates to Y; only through the treatment indicator X;. 
Then Z; and u; are uncorrelated but Z; influences receipt of treatment so it is 
correlated with X;. Thus, Z; is a valid instrument for X; and (13.14) can be 
estimated using TSLS. 


The following code chunk generates sample data where observations with a value 
of the running variable W; below the cutoff c = 0 do not receive treatment and 
observations with W; > 0 do receive treatment with a probability of 80% so 
that treatment status is only partially determined by the running variable and 
the cutoff. Treatment leads to an increase in Y by 2 units. Observations with 
W; > 0 that do not receive treatment are called no-shows: think of an individual 
that was assigned to receive the treatment but somehow manages to avoid it. 


library (MASS) 


# generate sample data 
mu <- c(0, 0) 
sigma <- matrix(c(1, 0.7, 0.7, 1), ncol = 2) 


set.seed (1234) 
d <- as.data.frame(mvrnorm(2000, mu, sigma) ) 
colnames(d) <- c("W", "Y") 


# introduce fuzziness 
d$treatProb <- ifelse(d$W < 0, 0, 0.8) 


fuzz <- sapply(X = d$treatProb, FUN = function(x) rbinom(1, 1, prob = x)) 


# treatment effect 
d$Y <- d$Y + fuzz * 2 


sapply() applies the function provided to FUN to every element of the argument 
X. Here, d$treatProb is a vector and the result is a vector, too. 


We plot all observations and use blue color to mark individuals that did not 
receive the treatment and use red color for those who received the treatment. 


# generate a colored plot of treatment and control group 
plot (d$w, d$Y, 
col = c("steelblue", "darkred") [factor (fuzz)], 
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pch= 20, 

cex = 0.5, 

zlim = c63, 3); 
ylim = eC3.5. 5); 
xlab = "W", 

ylab = "Y") 


# add a dashed vertical line at cutoff 
abline(v = 0, lty = 2) 
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Obviously, receipt of treatment is no longer a deterministic function of the run- 
ning variable W. Some observations with W > 0 did not receive the treatment. 
We may estimate a FRDD by additionally setting treatProb as the assignment 
variable z in rdd_data(). Then rdd_reg_1m() applies the following TSLS 
procedure: treatment is predicted using W; and the cutoff dummy Z;, the in- 
strumental variable, in the first stage regression. The fitted values from the first 
stage regression are used to obtain a consistent estimate of the treatment effect 
using the second stage where the outcome Y is regressed on the fitted values 
and the running variable W. 


# estimate the Fuzzy RDD 

data <- rdd_data(d$Y, d$w, 
cutpoint: = 0; 
z = d$treatProb) 


frdd_mod <- rdd_reg_1lm(rdd_object = data, 


slope = "same") 
frdd_mod 
#> ### RDD regression: parametric ### 
#> Polynomial order: 1 
#> Slopes: same 


#> Number of obs: 2000 (left: 999, right: 1001) 
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#> 

#> Coefficient: 

#> Estimate Std. Error t value Pr(>/t/) 

#> D 1.981297 0.084696 23.393 < 2.2e-16 *** 

#> --- 

HO Signi. codes OER AOL O01 Neel O OL OTOS OR td 


The estimate is close to 2, the population treatment effect. We may call plot () 
on the model object to obtain a figure consisting of binned data and the esti- 
mated regression function. 


# plot estimated FRDD function 
plot (frdd_mod, 

cex = 0.5, 

lwd = 0.4, 

xlim = c(-4, 4), 

ylim = c(-3.5, 5), 

žlab = "WwW"; 

ylab = "Y") 














h=3.3374/3.3374,n bins=4 (2/2) 


What if we used a SRDD instead, thereby ignoring the fact that treatment is 
not perfectly determined by the cutoff in W? We may get an impression of the 
consequences by estimating an SRDD using the previously simulated data. 


# estimate SRDD 
data <- rdd_data(d$Y, 
d$w, 
cutpoint = 0) 


srdd_mod <- rdd_reg_1m(rdd_object = data, 
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slope = "same" 
srdd_mod 
#> ### RDD regression: parametric ### 
#> Polynomial order: 1 
#> Slopes: same 
#> Number of obs: 2000 (left: 999, right: 1001) 
#> 
#> Coefficient: 


#> Estimate Std. Error t value Pr(>/t/) 

#> D 1.585088 0.067156 23-393 < 2. 26-16) F**k 

#> --- 

#> Stgntf. codes: 0 '***' OTOOTO OTO OSEO E] 


The estimate obtained using a SRDD is suggestive of a substantial downward 
bias. In fact, this procedure is inconsistent for the true causal effect so increasing 
the sample would not alleviate the bias. 


The book continues with a discussion of potential problems with quasi- 
experiments. As for all empirical studies, these potential problems are related 
to internal and external validity. This part is followed by a technical discussion 
of treatment effect estimation when the causal effect of treatment is heteroge- 
neous in the population. We encourage you to work on these sections on your 
own. 


Summary 


This chapter has introduced the concept of causal effects in randomized con- 
trolled experiments and quasi-experiments where variations in circumstances 
or accidents of nature are treated as sources of “as if” random assignment to 
treatment. We have also discussed methods that allow for consistent estima- 
tion of these effects in both settings. These included the differences estimator, 
the differences-in-differences estimator as well as sharp and fuzzy regression 
discontinuity design estimators. It was shown how to apply these estimation 
techniques in R. 


In an empirical application we have shown how to replicate the results of the 
analysis of the STAR data presented in Chapter 13.3 of the book using R. 
This study uses a randomized controlled experiment to assess whether smaller 
classes improve students’ performance on standardized tests. Being related to 
a randomized controlled experiment, the data of this study is fundamentally 
different to those used in the cross-section studies in Chapters 4 to 8. We 
therefore have motivated usage of a differences estimator. 


Chapter 12.4 demonstrated how estimates of treatment effects can be obtained 
when the design of the study is a quasi-experiment that allows for differences-in- 
differences or regression discontinuity design estimators. In particular, we have 
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introduced functions of the package rddtools that are convenient for estimation 
as well as graphical analysis when estimating a regression discontinuity design. 


13.5 Exercises 


The subsequent exercises guide you in reproducing some of the results presented 
in one of the most famous DID studies by Card and Krueger (1994). The authors 
use geography as the “as if” random treatment assignment to study the effect on 
employment in fast food restaurants caused by an increase in the state minimum 
wage in New Jersey in the year of 1992, see Chapter 13.4. 


The study is based on survey data collected in February 1992 and in November 
1992, after New Jersey’s minimum wage rose by $0.80 from $4.25 to $5.05 in 
April 1992. 


Estimating the effect of the wage increase simply by computing the change in 
employment in New Jersey (as you are asked to do in Exercise 3) would fail to 
control for omitted variables. By using Pennsylvania as a control in a difference- 
in-differences (DID) model one can control for variables with a common influence 
on New Jersey (treatment group) and Pennsylvania (control group). This re- 
duces the risk of omitted variable bias enormously and even works when these 
variables are unobserved. 


For the DID approach to work we must assume that New Jersey and Penn- 
sylvania have parallel trends over time, i.e., we assume that the (unobserved) 
factors influence employment in Pennsylvania and New Jersey in the same man- 
ner. This allows to interpret an observed change in employment in Pennsylvania 
as the change New Jersey would have experienced if there was no increase in 
minimum wage (and vice versa). 


Against to what standard economic theory would suggest, the authors did not 
find evidence that the increased minimum wage induced an increase in unem- 
ployment in New Jersey using the DID approach: quite the contrary, their 
results suggest that the $0.80 minimum wage increase in New Jersey led to a 
2.75 full-time equivalent (FTE) increase in employment. 


Chapter 14 


Introduction to Time Series 
Regression and Forecasting 


Time series data is data is collected for a single entity over time. This is funda- 
mentally different from cross-section data which is data on multiple entities at 
the same point in time. Time series data allows estimation of the effect on Y of 
a change in X over time. This is what econometricians call a dynamic causal 
effect. Let us go back to the application to cigarette consumption of Chapter 
12 where we were interested in estimating the effect on cigarette demand of a 
price increase caused by a raise of the general sales tax. One might use time 
series data to assess the causal effect of a tax increase on smoking both initially 
and in subsequent periods. 


Another application of time series data is forecasting. For example, weather ser- 
vices use time series data to predict tomorrow’s temperature by inter alia using 
today’s temperature and temperatures of the past. To motivate an economic ex- 
ample, central banks are interested in forecasting next month’s unemployment 
rates. 


The remainder of Chapters in the book deals with the econometric techniques 
for the analysis of time series data and applications to forecasting and estima- 
tion of dynamic causal effects. This section covers the basic concepts presented 
in Chapter 14 of the book, explains how to visualize time series data and demon- 
strates how to estimate simple autoregressive models, where the regressors are 
past values of the dependent variable or other variables. In this context we also 
discuss the concept of stationarity, an important property which has far-reaching 
consequences. 


Most empirical applications in this chapter are concerned with forecasting and 
use data on U.S. macroeconomic indicators or financial time series like Gross 
Domestic Product (GDP), the unemployment rate or excess stock returns. 
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The following packages and their dependencies are needed for reproduction of 
the code chunks presented throughout this chapter: 


e AER (Kleiber and Zeileis, 2020) 

e dynim (Zeileis, 2019) 

e forecast (Hyndman et al., 2020) 

e readxl (Wickham and Bryan, 2019) 
e stargazer (Hlavac, 2018) 

e scales (Wickham and Seidel, 2020) 
e quantmod (Ryan and Ulrich, 2020) 
e urca (Pfaff, 2016) 


Please verify that the following code chunk runs on your machine without any 
errors. 


library (AER) 
library (dynlm) 
library (forecast) 
library (readx1) 
library (stargazer) 
library (scales) 
library (quantmod) 
library (urca) 


14.1 Using Regression Models for Forecasting 


What is the difference between estimating models for assessment of causal effects 
and forecasting? Consider again the simple example of estimating the casual 
effect of the student-teacher ratio on test scores introduced in Chapter 4. 


library (AER) 

data(CASchools) 

CASchools$STR <- CASchools$students/CASchools$teachers 
CASchools$score <- (CASchools$read + CASchools$math) /2 


mod <- lm(score ~ STR, data = CASchools) 

mod 

#> 

#> Call: 

#> lm(formula = score ~ STR, data = CASchools) 
#> 

#> Coefficients: 

#> (Intercept) STR 

#> 698.93 Senco 
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As has been stressed in Chapter 6, the estimate of the coefficient on the student- 
teacher ratio does not have a causal interpretation due to omitted variable 
bias. However, in terms of deciding which school to send her child to, it might 
nevertheless be appealing for a parent to use mod for forecasting test scores in 
schooling districts where no public data about on scores are available. 


As an example, assume that the average class in a district has 25 students. This 
is not a perfect forecast but the following one-liner might be helpful for the 
parent to decide. 


predict(mod, newdata = data.frame("STR" = 25)) 
#> 1 
#> 641.9377 


In a time series context, the parent could use data on present and past years 
test scores to forecast next year’s test scores — a typical application for an 
autoregressive model. 


14.2 Time Series Data and Serial Correlation 


GDP is commonly defined as the value of goods and services produced over a 
given time period. The data set us_macro_quarterly.xlsx is provided by the 
authors and can be downloaded here. It provides quarterly data on U.S. real 
(i.e. inflation adjusted) GDP from 1947 to 2004. 


As before, a good starting point is to plot the data. The package quantmod 
provides some convenient functions for plotting and computing with time series 
data. We also load the package readx1 to read the data into R. 


# attach the package 'quantmod' 
library (quantmod) 


We begin by importing the data set. 


# load US macroeconomic data 
USMacroSWQ <- read_xlsx("Data/us_macro_quarterly.xlsx", 
sheet = 1, 
col_types = c("text", rep("numeric", 9))) 


# format date column 
USMacroSWQ$...1 <- as.yearqtr(USMacroSWQ$...1, format = "/Y:0%q") 


# adjust column names 
colnames(USMacroSWQ) <- c("Date", "GDPC96", "JAPAN_IP", "PCECTPI", 


“GS10M>) UGolt es VEBSMS >) JUNRATE. SEXUSUK" , 


"CPIAUCSL") 
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We the first column of us_macro_quarterly.xlsx contains text and the re- 
maining ones are numeric. Using col_types = c("text", rep("numeric", 
9)) we tell read_xlsx() take this into account when importing the data. 


It is useful to work with time-series objects that keep track of the frequency of 
the data and are extensible. In what follows we will use objects of the class xts, 
see ?xts. Since the data in USMacroSWQ are in quarterly frequency we convert 
the first column to yearqtr format before generating the xts object GDP. 


# GDP series as sts object 
GDP <- xts(USMacroSWQ$GDPC96, USMacroSWQ$Date) ["1960: :2013"] 


# GDP growth series as sts object 
GDPGrowth <- xts(400 * log(GDP/lag(GDP) )) 


The following code chunks reproduce Figure 14.1 of the book. 


# reproduce Figure 14.1 (a) of the book 
plot (log(as.zoo(GDP)) , 











col = "steelblue", 
lwd = 2, 
ylab = "Logarithm", 
xlab = "Date", 
main = "U.S. Quarterly Real GDP") 
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# reproduce Figure 14.1 (b) of the book 
plot (as.zoo(GDPGrowth) , 

col = "steelblue", 

lwd = 2, 

ylab = "Logarithm", 
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xlab = "Date", 
main = "U.S. Real GDP Growth Rates") 


U.S. Real GDP Growth Rates 
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Notation, Lags, Differences, Logarithms and Growth Rates 


For observations of a variable Y recorded over time, Y, denotes the value ob- 
served at time t. The period between two sequential observations Y, and Y;_1 
is a unit of time: hours, days, weeks, months, quarters, years etc. Key Concept 
14.1 introduces the essential terminology and notation for time series data we 
use in the subsequent sections. 
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Key Concept 14.1 
Lags, First Differences, Logarithms and Growth Rates 


Previous values of a time series are called lags. The first lag of 
Y, is Y1. The jt” lag of Y, is Y%_;. In R, lags of univariate 
or multivariate time series objects are conveniently computed by 
lag(), see ?lag. 


Sometimes we work with differenced series. The first difference of 
a series is AY, = Y, — Y;_1, the difference between periods t and 
t—1. If Y is a time series, the series of first differences is computed 
as diff (Y). 


It may be convenient to work with the first difference in logarithms 
of a series. We denote this by A log(Y;) = log(Y;) — log(¥_1). For 
a time series Y, this is obtained using log(Y/lag(Y) ). 


100A log(Y;) is an approximation for the percentage change be- 
tween Y, and Y;_}. 





The definitions made in Key Concept 14.1 are useful because of two properties 
that are common to many economic time series: 


e Exponential growth: some economic series grow approximately exponen- 
tially such that their logarithm is approximately linear. 


e The standard deviation of many economic time series is approximately 
proportional to their level. Therefore, the standard deviation of the loga- 
rithm of such a series is approximately constant. 


Furthermore, it is common to report growth rates in macroeconomic series which 
is why log-differences are often used. 


Table 14.1 of the book presents the quarterly U.S. GDP time series, its loga- 
rithm, the annualized growth rate and the first lag of the annualized growth 
rate series for the period 2012:Q1 - 2013:Q1. The following simple function can 
be used to compute these quantities for a quarterly time series series. 


# compute logarithms, annual growth rates and ist lag of growth rates 
quants <- function(series) { 
s <- series 
return ( 
data.frame("Level" = s, 
"Logarithm" = log(s), 
"AnnualGrowthRate" = 400 * log(s / lag(s)), 
"1{stLagAnnualGrowthRate" = lag(400 * log(s / lag(s)))) 
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The annual growth rate is computed using the approximation 


AnnualGrowthY; = 400 - A log(Y;) 


since 100 - Alog(Y;) is an approximation of the quarterly percentage changes, 
see Key Concept 14.1. 


We call quants() on observations for the period 2011:Q3 - 2013:Q1. 


# obtain a data. frame with level, logarithm, annual growth rate and its ist lag of GDP 
quants (GDP ["2011-07: :2013-01"]) 


#> Level Logarithm AnnualGrowthRate XistLagAnnualGrowthRate 
#> 2011 Q3 15062.14 9.619940 NA NA 
#> 2011 Q4 15242.14 9.631819 4.7518062 NA 
#> 2012 Q1 15381.56 9.640925 3. 6422231 4. 7518062 
#> 2012 Q2 15427.67 9.643918 1. 1972004 3. 6422231 
#> 2012 QƏ 15533.99 9.650785 2. 7470216 1. 1972004 
#> 2012 Q4 15539.63 9.651149 0. 1452808 2. '7470216 
#> 2013 Q1 15583.95 9.653997 1.1392015 0. 1452808 


Autocorrelation 


Observations of a time series are typically correlated. This type of correlation 
is called autocorrelation or serial correlation. Key Concept 14.2 summarizes 
the concepts of population autocovariance and population autocorrelation and 
shows how to compute their sample equivalents. 
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Key Concept 14.2 
Autocorrelation and Autocovariance 


The covariance between Y, and its jt? lag, Y;-;, is called the j’" au- 
tocovariance of the series Y,. The jt” autocorrelation coefficient, also 
called the serial correlation coefficient, measures the correlation between 
Y, and Y;_;. We thus have 


j’"autocovariance = Covu(Y;, Y:_;), 


= Cou(¥:, Yiz) 
— /Var(%,)Var(¥.-;) 





jt autocorrelation = pj = py, Yı; 





Population autocovariance and population autocorrelation can be esti- 


a 


mated by Cov(Y%, Y;:-;), the sample autocovariance, and p;, the sample 
autocorrelation: 
pee ‘i 
Cov(Yr, Yii) = z Xo (a W T O Viar 
t=j+1 
Ca e) 


rr 


Var(Y¥;) 


Y j+1:r denotes the average of Yj41, Yj +2,...,Yr. 


In R the function acf() from the package stats computes the sample 
autocovariance or the sample autocorrelation function. 





Using acf () it is straightforward to compute the first four sample autocorrela- 
tions of the series GDPGrowth. 


acf(na.omit(GDPGrowth), lag.max = 4, plot = F) 

#> 

#> Autocorrelations of series 'na.omit(GDPGrowth)', by lag 
#> 

#> 0.00 0.25 0.50 0.75 1.00 

#> 1.000 0.352 0.273 0.114 0.106 


This is evidence that there is mild positive autocorrelation in the growth of 
GDP: if GDP grows faster than average in one period, there is a tendency for 
it to grow faster than average in the following periods. 
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Other Examples of Economic Time Series 


Figure 14.2 of the book presents four plots: the U.S. unemployment rate, the 
U.S. Dollar / British Pound exchange rate, the logarithm of the Japanese indus- 
trial production index as well as daily changes in the Wilshire 5000 stock price 
index, a financial time series. The next code chunk reproduces the plots of the 
three macroeconomic series and adds percentage changes in the daily values of 
the New York Stock Exchange Composite index as a fourth one (the data set 
NYSESW comes with the AER package). 


# define series as Tts objects 
USUnemp <- xts(USMacroSWQ$UNRATE, USMacroSWQ$Date) ["1960: :2013"] 


DollarPoundFX <- xts(USMacroSWQ$EXUSUK, USMacroSWQ$Date) ["1960: :2013"] 
JPIndProd <- xts(log(USMacroSWQ$JAPAN_IP), USMacroSWQ$Date) ["1960: :2013"] 
# attach NYSESW data 

data("NYSESW") 


NYSESW <- xts (Delt (NYSESW) ) 


# divide plotting area into 212 matriz 
par(mfrow = c(2, 2)) 


# plot the series 
plot (as.zoo(USUnemp) , 


col = "steelblue", 
lwd = 2, 

ylab = "Percent", 
xlab = "Date", 


main = "US Unemployment Rate", 
cex.main = 1) 


plot (as.zoo(DollarPoundFX) , 


col = "steelblue", 
lwd = 2, 
ylab = "Dollar per pound", 


xlab = "Date", 
main = "U.S. Dollar / B. Pound Exchange Rate", 
cex.main = 1) 


plot (as.zoo(JPIndProd) , 
col = "steelblue", 
lwd = 2, 
ylab = "Logarithm", 
xlab = "Date", 
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main = "Japanese Industrial Production", 
cex.main = 1) 


plot (as.zoo(NYSESW) , 
col = "steelblue", 
lwd = 2, 
ylab = "Percent per Day", 
xlab = "Date", 
main = "New York Stock Exchange Composite Index", 
cex.main = 1) 
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The series show quite different characteristics. The unemployment rate increases 
during recessions and declines during economic recoveries and growth. The 
Dollar/Pound exchange rates shows a deterministic pattern until the end of 
the Bretton Woods system. Japan’s industrial production exhibits an upward 
trend and decreasing growth. Daily changes in the New York Stock Exchange 
composite index seem to fluctuate randomly around the zero line. The sample 
autocorrelations support this conjecture. 


# compute sample autocorrelation for the NYSESW series 
acf(na.omit(NYSESW), plot = F, lag.max = 10) 


#> 

#> Autocorrelations of series 'na.omit(NYSESW)', by lag 

#> 

#> 0 1 2 3 4 5 6 Y 8 9 10 


#> 1700070107007016 00230000 -0036 -0-027 -0059030137 0.017 0.004 


The first 10 sample autocorrelation coefficients are very close to zero. The 
default plot generated by acf () provides further evidence. 
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# plot sample autocorrelation for the NYSESW series 
acf(na.omit(NYSESW), main = "Sample Autocorrelation for NYSESW Data") 


Sample Autocorrelation for NYSESW Data 
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The blue dashed bands represent values beyond which the autocorrelations are 
significantly different from zero at 5% level. Even when the true autocorrela- 
tions are zero, we need to expect a few exceedences — recall the definition of 
a type-Lerror from Key Concept 3.5. For most lags we see that the sample 
autocorrelation does not exceed the bands and there are only a few cases that 
lie marginally beyond the limits. 


Furthermore, the NYSESW series exhibits what econometricians call volatility 
clustering: there are periods of high and periods of low variance. This is com- 
mon for many financial time series. 


14.3 Autoregressions 


Autoregressive models are heavily used in economic forecasting. An autoregres- 
sive model relates a time series variable to its past values. This section discusses 
the basic ideas of autoregressions models, shows how they are estimated and dis- 
cusses an application to forecasting GDP growth using R. 


The First-Order Autoregressive Model 


It is intuitive that the immediate past of a variable should have power to predict 
its near future. The simplest autoregressive model uses only the most recent 
outcome of the time series observed to predict future values. For a time series 
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Y, such a model is called a first-order autoregressive model, often abbreviated 
AR(1), where the 1 indicates that the order of autoregression is one: 


Yı = Bo + b1Yi—1 + ut 


is the AR(1) population model of a time series Y;. 


For the GDP growth series, an autoregressive model of order one uses only 
the information on GDP growth observed in the last quarter to predict a fu- 
ture growth rate. The first-order autoregression model of GDP growth can 
be estimated by computing OLS estimates in the regression of GDPGR; on 
GDPGR,-1, 


GDPGR, = ĝo + b\GDPGR,_1. (14.1) 


Following the book we use data from 1962 to 2012 to estimate (14.1). This is 
easily done with the function ar.ols() from the package stats. 


# subset data 
GDPGRSub <- GDPGrowth["1962: :2012"] 


# estimate the model 

ar.ols(GDPGRSub, 
order.max = i, 
demean = F, 
intercept = T) 


#> Call: 
#> ar.ols(x = GDPGRSub, order.maz = 1, demean = F, intercept = T) 


#> Coefficients: 

#> 1 

#> 0.3384 

#> Intercept: 1.995 (0.2993) 


#> Order selected 1 sigma°2 estimated as 9.886 


We can check that the computations done by ar.ols() are the same as done 
by ImQ. 


# length of data set 
N <-length(GDPGRSub) 


GDPGR_level <- as.numeric(GDPGRSub [-1]) 
GDPGR_lags <- as.numeric(GDPGRSub[-N]) 
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# estimate the model 

armod <- 1m(GDPGR_level ~ GDPGR_lags) 
armod 

#> 

#> Call: 

#> lm(formula = GDPGR_level ~ GDPGR_lags) 
#> 

#> Coefficients: 

#> (Intercept) GDPGR_lags 

#> 1.9950 0.3384 


As usual, we may use coeftest() to obtain a robust summary on the estimated 
regression coefficients. 


# robust summary 
coeftest (armod, vcov. = vcovHC, type = "HC1") 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> (Intercept) 1.994986 0.351274 5.6793 4.691e-08 *** 

#> GDPGR_lags 0.338436 0.076188 4.4421 1.4°70e-05 *** 

SSS 

#> signif- codes: 0O Crk TOTOO Tee! ONO (*EIOR0S VOR dt FY A 


Thus the estimated model is 


GDPGR, = 1.995 + 0.3383GDPGR,_1. (14.2) 
(0.351) (0.076) 


We omit the first observation for GDPGRj 962 gi from the vector of the de- 
pendent variable since GDPG R1962 gi-1 = GDPGRig9¢61 ga, is not included 
in the sample. Similarly, the last observation, GDPG R2012 ga, is excluded 
from the predictor vector since the data does not include GDPG R2012 Q4+1 = 
GDPGR2013 Q1. Put differently, when estimating the model, one observation 
is lost because of the time series structure of the data. 


Forecasts and Forecast Errors 


Suppose Y, follows an AR(1) model with an intercept and that you have an 
OLS estimate of the model on the basis of observations for T periods. Then you 
may use the AR(1) model to obtain Yr4ıjr, a forecast for Yp41 using data up 
to period T where 


Yrouiyr = Bo + Ai Yr. 
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The forecast error is 


Forecast error = Yr41— Ypyiyjr- 


Forecasts and Predicted Values 


Forecasted values of Y, are not what we refer to as OLS predicted values of Y;. 
Also, the forecast error is not an OLS residual. Forecasts and forecast errors 
are obtained using out-of-sample values while predicted values and residuals 
are computed for in-sample values that were actually observed and used in 
estimating the model. 


The root mean squared forecast error (RMSFE) measures the typical size of the 
forecast error and is defined as 





RMSFE = j? (Yrs _ Prr) | 


The RMSFE is composed of the future errors u; and the error made when 
estimating the coefficients. When the sample size is large, the former may 
be much larger than the latter so that RMSFE ~ ,/Var()uz which can be 
estimated by the standard error of the regression. 


Application to GDP Growth 


Using (14.2), the estimated AR(1) model of GDP growth, we perform the fore- 
cast for GDP growth for 2013:Q1 (remember that the model was estimated 
using data for periods 1962:Q1 - 2012:Q4, so 2013:Q1 is an out-of-sample pe- 
riod). Plugging GDPGR2012.94 © 0.15 into (14.2), 


GDPG Ro013.91 = 1.995 + 0.348 - 0.15 = 2.047. 


The function forecast () from the forecast package has some useful features 
for forecasting time series data. 


library (forecast) 


# assign GDP growth rate in 2012:Q4 
new <- data.frame("GDPGR_lags" = GDPGR_level [N-1]) 


# forecast GDP growth rate in 2013:Q1 

forecast(armod, newdata = new) 

#> Point Forecast Lo 80 Hi 80 or 95) Hi 95 
#> 1 20/2155 2.036225 6.124534 2 213417 8-301723 
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Using forecast () produces the same point forecast of about 2.0, along with 
80% and 95% forecast intervals, see section 14.5. We conclude that our AR(1) 
model forecasts GDP growth to be 2% in 2013:Q1. 


How accurate is this forecast? The forecast error is quite large: GD PGR2013:Q1 © 
1.1% while our forecast is 2%. Second, by calling summary(armod) shows that 
the model explains only little of the variation in the growth rate of GDP and 
the SER is about 3.16. Leaving aside forecast uncertainty due to estimation 
of the model coefficients 89 and 81, the RMSFE must be at least 3.16%, the 
estimate of the standard deviation of the errors. We conclude that this forecast 
is pretty inaccurate. 


# compute the forecast error 

forecast (armod, newdata = new)$mean - GDPGrowth["2013"] [1] 
#> T 

#> 2013 Q1 0.9049532 


# R^2 
summary (armod) $r.squared 
#> [1] 0.1149576 


# SER 
summary (armod) $sigma 
#> [1] 3.15979 


Autoregressive Models of Order p 


For forecasting GDP growth, the AR(1) model (14.2) disregards any information 
in the past of the series that is more distant than one period. An AR(p) model 
incorporates the information of p lags of the series. The idea is explained in 
Key Concept 14.3. 


Key Concept 14.3 
Autoregressions 


An AR(p) model assumes that a time series Y, can be modeled by a 
linear function of the first p of its lagged values. 


W = Bo + E= ao qb oo 2 sb Gaia F We 


is an autoregressive model of order p where E(u¢|¥i-1, Yi-2, - - - , Yi-p) = 
0. 





Following the book, we estimate an AR(2) model of the GDP growth series from 
1962:Q1 to 2012:Q4. 
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# estimate the AR(2) model 
GDPGR_AR2 <- dynlm(ts(GDPGR_level) ~ L(ts(GDPGR_level)) + L(ts(GDPGR_level), 2)) 


coeftest(GDPGR_AR2, vcov. = sandwich) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 1.631747 0.402023 4.0588 7.096e-05 *** 


#> L(ts(GDPGR_level)) 0.277787 0.079250 3.5052 0.0005643 *** 
#> L(ts(GDPGR_level), 2) 0.179269 0.079951 2.2422 0.0260560 * 
#> --- 

Ho Sign- codes- TOLKA TOT OOL Wee O OL ey SOROS) a O ai, 


The estimation yields 


GDPGR, = 1.63 + 0.28GDPGR;_1 + 0.18GDPGR;-1. (14.3) 
(0.40) (0.08) (0.08) 


We see that the coefficient on the second lag is significantly different from zero. 
The fit improves slightly: R? grows from 0.11 for the AR(1) model to about 
0.14 and the SER reduces to 3.13. 


#R-2 
summary (GDPGR_AR2) $r . squared 
#> [1] 0.1425484 


# SER 


summary (GDPGR_AR2)$sigma 
#> [1] 3.132122 


We may use the AR(2) model to obtain a forecast for GDP growth in 2013:Q1 
in the same manner as for the AR(1) model. 


# AR(2) forecast of GDP growth in 2013:Q1 


forecast <- c("2013:Q1" = coef(GDPGR_AR2) %*%/ c(1, GDPGR_level[N-1], GDPGR_level[N-2]) 


This leads to a forecast error of roughly —1%. 


# compute AR(2) forecast error 
GDPGrowth["2013"][1] - forecast 
#> T 
#> 2013 Q1 -1.025358 
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14.4 Can You Beat the Market? (Part I) 


The theory of efficient capital markets states that stock prices embody all cur- 
rently available information. If this hypothesis holds, it should not be possible 
to estimate a useful model for forecasting future stock returns using publicly 
available information on past returns (this is also referred to as the weak-form 
efficiency hypothesis): if it was possible to forecast the market, traders would 
be able to arbitrage, e.g., by relying on an AR(2) model, they would use infor- 
mation that is not already priced-in which would push prices until the expected 
return is zero. 


This idea is presented in the box Can You Beat the Market? (Part I) on p. 582 
of the book. This section reproduces the estimation results. 


We start by importing monthly data from 1931:1 to 2002:12 on excess returns 
of a broad-based index of stock prices, the CRSP value-weighted index. The 
data are provided by the authors of the book as an excel sheet which can be 
downloaded here. 


# read in data on stock returns 

SReturns <- read_xlsx("Data/Stock_Returns_1931_2002.xlsx", 
sheet = i, 
col_types = "numeric" 


We continue by converting the data to an object of class ts. 


# convert to ts object 

StockReturns <- ts(SReturns[, 3:4], 
start = c(1931, 1), 
end = c(2002, 12), 
frequency = 12) 


Next, we estimate AR(1), AR(2) and AR(4) models of excess returns for the 
time period 1960:1 to 2002:12. 


# estimate AR models: 


# AR(1) 
SR_AR1 <- dynlm(ExReturn ~ L(ExReturn), 
data = StockReturns, start = c(1960, 1), end = c(2002, 12)) 


# AR(2) 
SR_AR2 <- dynlm(ExReturn ~ L(ExReturn) + L(ExReturn, 2), 
data = StockReturns, start = c(1960, 1), end = c(2002, 12)) 
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# AR(4) 
SR_AR4 <- dynlm(ExReturn ~ L(ExReturn) + L(ExReturn, 1:4), 
data = StockReturns, start = c(1960, 1), end = c(2002, 12)) 


After computing robust standard errors, we gather the results in a table gener- 
ated by stargazer(). 


# compute robust standard errors 

rob_se <- list (sqrt (diag(sandwich(SR_AR1))), 
sqrt (diag(sandwich(SR_AR2))), 
sqrt (diag (sandwich(SR_AR4)))) 


# generate table using 'stargazer()' 

stargazer(SR_AR1, SR_AR2, SR_AR4, 
title = "Autoregressive Models of Monthly Excess Stock Returns", 
header = FALSE, 
model.numbers = F, 


omit.table.layout = "n", 


digits = 3, 

column.labels = c("AR(1)", "AR(2)", "AR(4)"), 

dep.var.caption = "Dependent Variable: Excess Returns on the CSRP Value-Weighted In 
dep.var.labels.include = FALSE, 

covariate.labels = c("$excess return_{t-1}$", "$excess return_{t-2}$", 


"$excess return_{t-3}$", "$excess return_{t-4}$", 
"Intercept"), 

se = rob_se, 

omit.stat = "rsq") 


The results are consistent with the hypothesis of efficient financial markets: 
there are no statistically significant coefficients in any of the estimated models 
and the hypotheses that all coefficients are zero cannot be rejected. R? is almost 
zero in all models and even negative for the AR(4) model. This suggests that 
none of the models are useful for forecasting stock returns. 


14.5 Additional Predictors and The ADL Model 


Instead of only using the dependent variable’s lags as predictors, an autoregres- 
sive distributed lag (ADL) model also uses lags of other variables for forecasting. 
The general ADL model is summarized in Key Concept 14.4: 
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Table 14.1: Autoregressive Models of Monthly Excess Stock Returns 








Dependent Variable: Excess returns on the CSRP Value-Weighted Index 











AR(1) AR(2) AR(4) 
excessreturnt_1 0.050 0.053 0.054 
(0.051) (0.051) (0.051) 
excessreturnt_2 —0.053 
(0.048) 
excessreturnt_3 
excessreturnt_4 —0.054 
(0.048) 
Intercept 0.009 
(0.050) 
L(ExReturn, 1:4)4 —0.016 
(0.047) 
Constant 0.312 0.328* 0.331 
(0.197) (0.199) (0.202) 
Observations 516 516 516 
Adjusted R? 0.001 0.001 —0.002 


Residual Std. Error 
F Statistic 


4.334 (df = 514) 
1.306 (df = 1; 514) 


4.332 (df = 513) 
1.367 (df = 2; 513) 


4.340 (df = 511) 
0.721 (df = 4; 511) 
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Key Concept 14.4 
The Autoregressive Distributed Lag Model 


An ADL(p,q) model assumes that a time series Y, can be represented 
by a linear function of p of its lagged values and q lags of another time 
series X;: 


Ve one eee ee 
se 0, X¢-1 se 6oX4_2 ae soo ar bg Xt—qXt—q se Ut. 


is an autoregressive distributed lag model with p lags of Y, and q lags of 
X+ where 
E(ue|Yi-1, WoD see NN Xt_2, ge =) = 0; 





Forecasting GDP Growth Using the Term Spread 


Interest rates on long-term and short term treasury bonds are closely linked to 
macroeconomic conditions. While interest rates on both types of bonds have 
the same long-run tendencies, they behave quite differently in the short run. 
The difference in interest rates of two bonds with distinct maturity is called the 
term spread. 


The following code chunks reproduce Figure 14.3 of the book which displays 
interest rates of 10-year U.S. Treasury bonds and 3-months U.S. Treasury bills 
from 1960 to 2012. 


# 3-months Treasury bills interest rate 
TB3MS <- xts(USMacroSWQ$TB3MS, USMacroSWQ$Date) ["1960: :2012"] 


# 10-years Treasury bonds interest rate 
TB10YS <- xts(USMacroSWQ$GS10, USMacroSWQ$Date) ["1960: :2012"] 


# term spread 
TSpread <- TB10YS - TB3MS 


# reproduce Figure 14.2 (a) of the book 
plot (merge(as.zoo(TB3MS), as.zoo(TB10YS)), 
plotetyper=— “singlet, 
col = c("darkred", "steelblue"), 


lwd = 2, 

xlab = "Date", 

ylab = "Percent per annum", 
main = "Interest Rates") 


# define function that transform years to class 'yearqtr' 
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YToYQTR <- function(years) { 
return ( 
sort(as.yearqtr(sapply(years, paste, c("Q1i", "Q2", "Q3", "Q4")))) 
) 
ay 


# recessions 
recessions <- YToYQTR(c(1961:1962, 1970, 1974:1975, 1980:1982, 1990:1991, 2001, 2007:2008)) 


# add color shading for recessions 
xblocks(time(as.zoo(TB3MS)), 

c(time(TB3MS) %in recessions), 

col = alpha("steelblue", alpha = 0.3)) 


# add a legend 

legend("topright", 
legend = c("TB3MS", "TB10YS"), 
col = c("darkred", "steelblue"), 


iud = th, P))) 
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# reproduce Figure 14.2 (b) of the book 
plot (as.zoo(TSpread) , 


col = "steelblue", 

lwd = 2, 

xlab = "Date", 

ylab = "Percent per annum", 
main = "Term Spread") 


# add color shading for recessions 
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xblocks (time(as.zoo(TB3MS) ), 
c(time(TB3MS) %in% recessions), 
col = alpha("steelblue", alpha = 0.3)) 


Term Spread 
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Before recessions, the gap between interest rates on long-term bonds and short 
term bills narrows and consequently the term spread declines drastically towards 
zero or even becomes negative in times of economic stress. This information 
might be used to improve GDP growth forecasts of future. 


We check this by estimating an ADL(2, 1) model and an ADL(2, 2) model of 
the GDP growth rate using lags of GDP growth and lags of the term spread as 
regressors. We then use both models for forecasting GDP growth in 2013:Q1. 


# convert growth and spread series to ts objects 
GDPGrowth_ts <- ts(GDPGrowth, 

start = c(1960, 1), 

end = c(2013, 4), 

frequency = 4) 


TSpread_ts <- ts(TSpread, 
start = c(1960, 1), 
end = c(2012, 4), 
frequency = 4) 


# join both ts objects 
ADLdata <- ts.union(GDPGrowth_ts, TSpread_ts) 


# estimate the ADL(2,1) model of GDP growth 
GDPGR_ADL21 <- dynlm(GDPGrowth_ts ~ L(GDPGrowth_ts) + L(GDPGrowth_ts, 2) + L(TSpread_t: 
start = c(1962, 1), end = c(2012, 4)) 
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coeftest(GDPGR_ADL21, vcov. = sandwich) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 
#> (Intercept) 0.954990 0.486976 1.9611 0.051260 . 


#> L(GDPGrowth_ts) 0.267729 0.082562 3.2428 0.001387 ** 
#> L(GDPGrowth_ts, 2) 0.192370 0.077683 2.4763 0.014104 * 


#> L(TSpread_ts) 0.444047 0.182637 2.4313 0.015925 * 
p 
#> Signif. codes: 0 '***' 0.001 '**! 0.01 '*' 0.05 '.'0.1''1 


The estimated equation of the ADL(2, 1) model is 


GDPGR, = 0.96 + 0.26GDPGR,_1 + 0.19GDPGR;_2 + 0.44T Spread;_1 
(0.49) (0.08) (0.08) (0.18) 
(14.4) 


All coefficients are significant at the level of 5%. 


# 2012:Q3 / 2012:Q4 data on GDP growth and term spread 
subset <- window(ADLdata, c(2012, 3), c(2012, 4)) 


# ADL(2,1) GDP growth forecast for 2013:Q1 

ADL21_forecast <- coef(GDPGR_ADL21) %*% c(1, subset[2, 1], subset[1, 1], subset[2, 2]) 
ADL21_forecast 

#> I a} 

#> [1,] 2.241689 


# compute the forecast error 

window(GDPGrowth_ts, c(2013, 1), c(2013, 1)) - ADL21_forecast 
#> Qtri 

#> 2013 -1.102487 


Model (14.4) predicts the GDP growth in 2013:Q1 to be 2.24% which leads to 
a forecast error of —1.10%. 


We estimate the ADL(2,2) specification to see whether adding additional infor- 
mation on past term spread improves the forecast. 


# estimate the ADL(2,2) model of GDP growth 

GDPGR_ADL22 <- dynlm(GDPGrowth_ts ~ L(GDPGrowth_ts) + L(GDPGrowth_ts, 2) 
+ L(TSpread_ts) + L(TSpread_ts, 2), 
start = c(1962, 1), end = c(2012, 4)) 
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coeftest (GDPGR_ADL22, vcov. = sandwich) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 0.967967 0.472470 2.0487 0.041800 * 
#> L(GDPGrowth_ts) 0.243175 0.077836 3.1242 0.002049 ** 
#> L(GDPGrowth_ts, 2) 0.177070 0.077027 2.2988 0.022555 * 
#> L(TSpread_ts) -07139552 0 s22162)—0) 3806 On 1317, 
#> L(TSpread_ts, 2) 0.656347 0.429802 1.5271 0.128326 
Jo 

#> otgniyf. codes: O ***" OF001 Nee! OI AAO OS O 1 WA 
We obtain 


GDPGR, =0.98 + 0.24GDPGR:_1 
(0.47) (0.08) 
+ 0.18GDPGR,_2 — 0.147 Spreadı—ı + 0.66 T Spreadı—2. 
(0.08) (0.42) (0.43) 
(14.5) 


The coefficients on both lags of the term spread are not significant at the 10% 
level. 


# ADL(2,2) GDP growth forecast for 2013:Q1 

ADL22 forecast <- coef(GDPGR_ADL22) %*% c(i, subset[2, 1], subset[1, 1], subset[2, 2], 
ADL22_ forecast 

#> eal) 

#> [A] 2.274207 


# compute the forecast error 

window(GDPGrowth_ts, c(2013, 1), c(2013, 1)) - ADL22_ forecast 
#> Qtri 

#> 2013 -1.135206 


The ADL(2,2) forecast of GDP growth in 2013:Q1 is 2.27% which implies a 
forecast error of 1.14%. 


Do the ADL models (14.4) and (14.5) improve upon the simple AR(2) model 
(14.3)? The answer is yes: while SER and R? improve only slightly, an F-test 
on the term spread coefficients in (14.5) provides evidence that the model does 
better in explaining GDP growth than the AR(2) model as the hypothesis that 
both coefficients are zero cannot be rejected at the level of 5%. 
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# compare adj. R2 

c("Adj.R2 AR(2)" = summary(GDPGR_AR2)$r.squared, 
"Adj.R2 ADL(2,1)" = summary (GDPGR_ADL21)$r.squared, 
"Adj.R2 ADL(2,2)" = summary (GDPGR_ADL22) $r.squared) 

#> Adj.R2 AR(2) Adj.R2 ADL(2,1) Adj.R2 ADL(2,2) 

#> 0. 1425484 0. 1743996 0.1855245 


# compare SER 

c("SER AR(2)" = summary (GDPGR_AR2)$sigma, 
"SER ADL(2,1)" = summary(GDPGR_ADL21)$sigma, 
"SER ADL(2,2)" = summary (GDPGR_ADL22) $sigma) 

#> SER AR(2) SER ADL(2,1) SER ADL(2,2) 

#> 3. 132122 3.070760 3.057655 


# F-test on coefficients of term spread 

linearHypothesis(GDPGR_ADL22, 
c("L(TSpread_ts)=0", "L(TSpread_ts, 2)=0"), 
vcov. = sandwich) 

#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> L(TSpread_ts) = 0 

#> L(TSpread_ts, 2) = 0 

#> 

#> Model 1: restricted model 

#> Model 2: GDPGrowth_ts ~ L(GDPGrowth_ts) + L(GDPGrowth_ts, 2) + L(TSpread_ts) + 


#> L(TSpread_ts, 2) 

#> 

#> Note: Coefficient covariance matrix supplied. 

#> 

#> Res.Df Df Fe Prk) 

#> 1 201 

#> 2 199 2 4.4344 0.01306 * 

Le SSS 

#> CON. codes | (Ol rz ONOOT Ie! TOOL OOS sO ds ot 
Stationarity 


In general, forecasts can be improved by using multiple predictors — just as 
in cross-sectional regression. When constructing time series models one should 
take into account whether the variables are stationary or nonstationary. Key 
Concept 14.5 explains what stationarity is. 
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Key Concept 14.5 
Stationarity 


A time series Y; is stationary if its probability distribution is time 
independent, that is the joint distribution of Y;41, Ys+2,...,Y¥s4+r does 
not change as s is varied, regardless of T. 


Similarly, two time series X; and Y; are jointly stationary if the 
joint distribution of (X41, Ys+1, Xs+2, Ys+2 . - -, Xs+T, Ys+r) does not 





depend on s, regardless of T. 


In a probabilistic sense, stationarity means that information about how 
a time series evolves in the future is inherent to its past. If this is not 
the case, we cannot use the past of a series as a reliable guideline for its 
future. 


Stationarity makes it easier to learn about the characteristics of past 
data. 





Time Series Regression with Multiple Predictors 


The concept of stationarity is a key assumption in the general time series re- 
gression model with multiple predictors. Key Concept 14.6 lays out this model 
and its assumptions. 
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Key Concept 14.6 
Time Series Regression with Multiple Predictors 


The general time series regression model extends the ADL model such 
that multiple regressors and their lags are included. It uses p lags of 
the dependent variable and q lags of l additional predictors where | = 
aK: 
Yı = po + P1 Ys—1 + B2Yr—2 +- - + Bo¥i—p 

Öne gaa F OA gow te O°” E Ona Ni dae 

+... 

+ 641 Xk t1 + On2Xkt—2 + «+> + OkgXk,t—¢ 

ap Ue 


For estimation we make the following assumptions: 


1. The error term u, has conditional mean zero given all regressors 
and their lags: 


TANGY Beas -2 oo 8 sA il hagas O08 ig oh 26 =) 


This assumption is an extension of the conditional mean zero as- 
sumption used for AR and ADL models and guarantees that the 
general time series regression model stated above gives the best 
forecast of Y; given its lags, the additional regressors Xj 4,...,Xzk,t 
and their lags. 


. The iid. assumption for cross-sectional data is not (entirely) 
meaningful for time series data. We replace it by the following 
assumption witch consists of two parts: 


(a) The (Yt, Xıt,---, Xk ıt) have a stationary distribution (the 
"identically distributed" part of the i.i.d. assumption for cross- 
setional data). If this does not hold, forecasts may be biased 
and inference can be strongly misleading. 


GF. Ginn oa Xz t) and eas Migade coa Aug) become in- 
dependent as j gets large (the "idependently" distributed part 
of the iid. assumption for cross-sectional data). This as- 
sumption is also called weak dependence. It ensures that the 
WLLN and the CLT hold in large samples. 


3. Large outliers are unlikely: E(X?,), E(X3,), eee, EX X@,) and 
E(Y/4) have nonzero, finite fourth moments. 


4. No perfect multicollinearity. 





408CHAPTER 14. INTRODUCTION TO TIME SERIES REGRESSION AND FORECASTING 


Since many economic time series appear to be nonstationary, assumption two of 
Key Concept 14.6 is a crucial one in applied macroeconomics and finance which 
is why statistical test for stationarity or nonstationarity have been developed. 
Chapters 14.6 and 14.7 are devoted to this topic. 


Statistical inference and the Granger causality test 


If a X is a useful predictor for Y, in a regression of Y; on lags of its own and 
lags of X+, not all of the coefficients on the lags on X; are zero. This concept is 
called Granger causality and is an interesting hypothesis to test. Key Concept 
14.7 summarizes the idea. 


Key Concept 14.7 
Granger Causality Tests 


The Granger causality test (Granger, 1969) is an F test of the null hy- 


pothesis that all lags of a variable X included in a time series regression 
model do not have predictive power for Y;. The Granger causality test 
does not test whether X actually causes Y but whether the included lags 
are informative in terms of predicting Y. 





We have already performed a Granger causality test on the coefficients of term 
spread in (14.5), the ADL(2,2) model of GDP growth and concluded that at 
least one of the first two lags of term spread has predictive power for GDP 
growth. 


Forecast Uncertainty and Forecast Intervals 


In general, it is good practice to report a measure of the uncertainty when 
presenting results that are affected by the latter. Uncertainty is particularly of 
interest when forecasting a time series. For example, consider a simple ADL(1, 1) 
model 


Y; = Bo + Pr: Yi-1 + 61 Xt-1 + Ut 


where uz is a homoskedastic error term. The forecast error is 





Yr — Yrqir = urs [(Bo Bo) + (Bi — Bi) ¥r + (51 — 81) Xr} . 
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The mean squared forecast error (MSFE) and the RMFSE are 


MFSE=E [Yr = Prysir)?| 





= 0? + Var |(Bo — Bo) + (Bi — Bu)¥r + (61 — 61) Xr] 





Nn 


RMFSE = fo? + Var [(Bo Bo) + (B1 — b1)Yr + (6 6:)Xr). 





A 95% forecast interval is an interval that covers the true value of Yr+1 in 95% 
of repeated applications. There is a major difference in computing a confidence 
interval and a forecast interval: when computing a confidence interval of a point 
estimate we use large sample approximations that are justified by the CLT and 
thus are valid for a large range of error term distributions. For computation of 
a forecast interval of Yrp+41, however, we must make an additional assumption 
about the distribution of ur+1, the error term in period T + 1. Assuming that 
ur+1 is normally distributed one can construct a 95% forecast interval for Yr44 
using SE(Yr41 — rir), an estimate of the RMSFE: 





Fryar +1.96 - SE(Yr41 — Yryijr) 


Of course, the computation gets more complicated when the error term is 
heteroskedastic or if we are interested in computing a forecast interval for 
T+s,s>l. 


In some applications it is useful to report multiple forecast intervals for subse- 
quent periods, see the box The River of Blood on p. 592 of the book. These 
can be visualized in a so-called fan chart. We will not replicate the fan chart 
presented in Figure 14.2 of book because the underlying model is by far more 
complex than the simple AR and ADL models treated here. Instead, in the 
example below we use simulated time series data and estimate an AR(2) model 
which is then used for forecasting the subsequent 25 future outcomes of the 
series. 


# set seed 
set .seed(1234) 


# simulate the time series 
Y <- arima.sim(list(order = c(2, 0, 0), ar = c(0.2, 0.2)), n = 200) 


# estimate an AR(2) model using 'arima()', see ?arima 
model <- arima(Y, order = c(2, 0, 0)) 


# compute points forecasts and prediction intervals for the next 25 periods 
fc <- forecast(model, h = 25, level = seq(5, 99, 10)) 


# plot a fan chart 
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plot(fc, 
main = "Forecast Fan Chart for AR(2) Model of Simulated Data", 
showgap = F, 
fcol = "red", 
flty = 2) 


Forecast Fan Chart for AR(2) Model of Simulated Data 
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arima.sim() simulates autoregressive integrated moving average (ARIMA) 
models. AR models belong to this class of models. We use list (order = c(2, 
0, 0), ar = c(0.2, 0.2)) so the DGP is 


Yı = 0.2Y;-1 + 0.2Y;_2 + ut. 


We choose level = seq(5, 99, 10) in the call of forecast() such that fore- 
cast intervals with levels 5%, 15%, . . . , 95% are computed for each point forecast 
of the series. 


The dashed red line shows point forecasts of the series for the next 25 periods 
based on an ADL(1,1) model and the shaded areas represent the prediction 
intervals. The degree of shading indicates the level of the prediction interval. 
The darkest of the blue bands displays the 5% forecast intervals and the color 
fades towards grey as the level of the intervals increases. 


14.6 Lag Length Selection Using Information 
Criteria 


The selection of lag lengths in AR and ADL models can sometimes be guided 
by economic theory. However, there are statistical methods that are helpful 
to determine how many lags should be included as regressors. In general, too 
many lags inflate the standard errors of coefficient estimates and thus imply an 
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increase in the forecast error while omitting lags that should be included in the 
model may result in an estimation bias. 


The order of an AR model can be determined using two approaches: 


1. The F-test approach 


Estimate an AR(p) model and test the significance of the largest lag(s). If 
the test rejects, drop the respective lag(s) from the model. This approach 
has the tendency to produce models where the order is too large: in a 
significance test we always face the risk of rejecting a true null hypothesis! 


2. Relying on an information criterion 


To circumvent the issue of producing too large models, one may choose 
the lag order that minimizes one of the following two information criteria: 


e The Bayes information criterion (BIC): 


SSR(p)\ |, , ,,los(T) 
T ) +c ay T 





BIC(p) = log ( 
e The Akaike information criterion (AIC): 


SSR(p) 2 
AIC(») = 10g ( a )+@+nz 
Both criteria are estimators of the optimal lag length p. The lag order p 
that minimizes the respective criterion is called the BIC estimate or the 
AIC estimate of the optimal model order. The basic idea of both criteria 
is that the SSR decreases as additional lags are added to the model such 
that the first term decreases whereas the second increases as the lag order 
grows. One can show that the the BIC is a consistent estimator of the 
true lag order while the AIC is not which is due to the differing factors 
in the second addend. Nevertheless, both estimators are used in practice 
where the AIC is sometimes used as an alternative when the BIC yields 
a model with “too few” lags. 


The function dynlm() does not compute information criteria by default. We will 
therefore write a short function that reports the BIC (along with the chosen 
lag order p and R?) for objects of class dyn1m. 


# compute BIC for AR model objects of class 'dynlm' 
BIC <- function(model) { 


ssr <- sum(model$residuals*2) 
t <- length(model$residuals) 
npar <- length(model$coef) 
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return ( 
round(c("p" = npar - 1, 
"BIC" = log(ssr/t) + npar * log(t)/t, 
"R2" = summary(model)$r.squared), 4) 
) 


Table 14.3 of the book presents a breakdown of how the BIC is computed for 
AR(p) models of GDP growth with order p = 1,...,6. The final result can 
easily be reproduced using sapply() and the function BIC() defined above. 


# apply the BIC() to an intercept-only model of GDP growth 
BIC(dynlm(ts(GDPGR_level) ~ 1)) 

#> p BIC R2 

#> 0.0000 2.4394 0.0000 


# loop BIC over models of different orders 
order <- 1:6 


BICs <- sapply(order, function(x) 
"AR" = BIC(dynlm(ts(GDPGR_level) ~ L(ts(GDPGR_level), 1:x)))) 


BICs 
#> stl E S eve gl TE Gil) 
#> p 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000 
K BIC 2.3486 2,3175 2.877) 2.103) 2.3188 2.1729 
#> R2 0.1143 0.1425 0.1434 0.1478 0.1604 0.1591 


Note that increasing the lag order increases R? because the SSR decreases as 
additional lags are added to the model but according to the BIC, we should 
settle for the AR(2) model instead of the AR(6) model. It helps us to decide 
whether the decrease in SSR is enough to justify adding an additional regressor. 


If we had to compare a bigger set of models, a convenient way to select the 
model with the lowest BIC is using the function which.min(). 


# select the AR model with the smallest BIC 
BICs[, which.min(BICs[2, ])] 

#> p BIC R2 

#> 2.0000 2.3475 0.1425 


The BIC may also be used to select lag lengths in time series regression models 
with multiple predictors. In a model with K coefficients, including the intercept, 
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we have 





BIC(K) = log (£ 2 =~) | 8) 


Notice that choosing the optimal model according to the BIC can be compu- 
tationally demanding because there may be many different combinations of lag 
lengths when there are multiple predictors. 


To give an example, we estimate ADL(p,q) models of GDP growth where, as 
above, the additional variable is the term spread between short-term and long- 
term bonds. We impose the restriction that p = qı = +-+- = qk so that only 
Pmax Models (p = 1,..., Pmax) need to be estimated. In the example below we 
choose Pmazx = 12. 


# loop 'BIC()' over multiple ADL models 
order <- 1:12 


BICs <- sapply(order, function(x) 


BIC(dynlm(GDPGrowth_ts ~ L(GDPGrowth_ts, 1:x) + L(TSpread_ 


start = c(1962, 1), end = c(2012, 4)))) 


BICs 

#> CI TEA L3 Aa) [5] Ke eel [,8] 
#> p 2.0000 4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000 
#> BIC 2.3411 2.3408 2.3813 2.4181 2.4568 2.5048 2.5539 2.6029 
#> R2 0.1417 0.1855 0.1950 0.2072 0.2178 0.2211 0.2234 0.2253 
#> [pele] Gel ey) 

#> p 22.0000 24.0000 

A> BIC 2117205 2) 1664 

#> R2 0.2702 0.2803 


From the definition of BIC(), for ADL models with p = q it follows that p 
reports the number of estimated coefficients excluding the intercept. Thus the 
lag order is obtained by dividing p by 2. 


# select the ADL model with the smallest BIC 
BICs[, which.min(BICs[2, ])] 

#> p BIC R2 

#> 4.0000 2.3408 0.1855 


The BIC is in favor of the ADL(2,2) model (14.5) we have estimated before. 


14.7 Nonstationarity I: Trends 


If a series is nonstationary, conventional hypothesis tests, confidence intervals 
and forecasts can be strongly misleading. The assumption of stationarity is 


[,9] 


[s10] 


18.0000 20.0000 
2.6182 2.6646 


0.2581 


0.2678 
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violated if a series exhibits trends or breaks and the resulting complications in 
an econometric analysis depend on the specific type of the nonstationarity. This 
section focuses on time series that exhibit trends. 


A series is said to exhibit a trend if it has a persistent long-term movement. 
One distinguishes between deterministic and stochastic trends. 


e A trend is deterministic if it is a nonrandom function of time. 
e A trend is said to be stochastic if it is a random function of time. 
The figures we have produced in Chapter 14.2 reveal that many economic time 


series show a trending behavior that is probably best modeled by stochastic 
trends. This is why the book focuses on the treatment of stochastic trends. 


The Random Walk Model of a Trend 


The simplest way to model a time series Y, that has stochastic trend is the 
random walk 


Y; = Yi + wu, (14.7) 
where the u, are i.i.d. errors with E(u,|¥;-1, ¥;-2,...) = 0. Note that 


E(Yi|Yi-1, Yi-2- +.) = E(Yi-1|Yi-1, Yes.) + Bul Vina Yee. .) 
=Y; 
so the best forecast for Y; is yesterday’s observation Y;_;. Hence the difference 


between Y; and Y;_; is unpredictable. The path followed by Y, consists of 
random steps uz, hence it is called a random walk. 


Assume that Yo, the starting value of the random walk is 0. Another way to 
write (14.7) is 

Yo =0 

Yı =0+ ui 

Yo =0 + u + u2 





t 
= Dm 
i=1 
Therefore we have 


Var(Y;) = Var(uı + uz +--+ u) 


af 22 
=toy,- 
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Thus the variance of a random walk depends on t which violates the assumption 
presented in Key Concept 14.5: a random walk is nonstationary. 


Obviously, (14.7) is a special case of an AR(1) model where 6; = 1. One can 
show that a time series that follows an AR(1) model is stationary if |81| < 1. 
In a general AR(p) model, stationarity is linked to the roots of the polynomial 





1 — Biz — Boz” — B32? —- ++ — Bpz”. 


If all roots are greater than 1 in absolute value, the AR(p) series is stationary. 
If at least one root equals 1, the AR(p) is said to have a unit root and thus has 
a stochastic trend. 


It is straightforward to simulate random walks in R using arima.sim(). The 
function matplot() is convenient for simple plots of the columns of a matrix. 


# simulate and plot random walks starting at O 
set.seed(1) 


RWs <- ts(replicate(n = 4, 
arima.sim(model = list(order = c(0, 1 ,0)), n = 100))) 


matplot(RWs, 
type= wl: 
col = c("steelblue", "darkgreen", "darkred", "“orange"), 
tye =al 
lwd = 2, 
main = "Four Random Walks", 
xlab = "Time", 


ylab = "Value") 


Four Random Walks 





Value 
0 5 10 15 
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0 20 40 60 80 100 
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Adding a constant to (14.7) yields 
Yı = Po + Yi-1 + ut, (14.8) 


a random walk model with a drift which allows to model the tendency of a series 
to move upwards or downwards. If 3o is positive, the series drifts upwards and 
it follows a downward trend if ĝo is negative. 


# simulate and plot random walks with drift 
set.seed(1) 


RWsd <- ts(replicate(n = 4, 
arima.sim(model = list(order = c(0, 1, 0)), 
n = 100, 
mean = -0.2))) 


matplot (RWsd, 
type = "1", 
col = c("steelblue", "darkgreen", "darkred", "“orange"), 
lty = 1, 
lwd = 2, 
main = "Four Random Walks with Drift", 
xlab = "Time", 
ylab = "Value") 


Four Random Walks with Drift 





Value 


-20 
| 








Time 


Problems Caused by Stochastic Trends 


OLS estimation of the coefficients on regressors that have a stochastic trend 
is problematic because the distribution of the estimator and its t-statistic is 
non-normal, even asymptotically. This has various consequences: 
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e Downward bias of autoregressive coefficients: 


If Y, is a random walk, 6; can be consistently estimated by OLS but the 
estimator is biased toward zero. This bias is roughly E(81) ~ 1 — 5.3/T 
which is substantial for sample sizes typically encountered in macroeco- 
nomics. This estimation bias causes forecasts of Y; to perform worse than 
a pure random walk model. 


e Non-normally distributed t-statistics: 


The nonnormal distribution of the estimated coefficient of a stochastic 
regressor translates to a nonnormal distribution of its t-statistic so that 
normal critical values are invalid and therefore usual confidence intervals 
and hypothesis tests are invalid, too, and the true distribution of the t- 
statistic cannot be readily determined. 


e Spurious Regression: 


When two stochastically trending time series are regressed onto each other, 
the estimated relationship may appear highly significant using conven- 
tional normal critical values although the series are unrelated. This is 
what econometricians call a spurious relationship. 


As an example for spurious regression, consider again the green and the red ran- 
dom walks that we have simulated above. We know that there is no relationship 
between both series: they are generated independently of each other. 


# plot spurious relationship 
matplot(RWs[, c(2, 3)], 


lty = 1, 

lwd = 2, 

type = "I", 

col = c("darkgreen", "darkred"), 
xlab = "Time", 

ylab = Muy 


main = "A Spurious Relationship") 
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A Spurious Relationship 





-10 











Time 


Imagine we did not have this information and instead conjectured that the 
green series is useful for predicting the red series and thus end up estimating 
the ADL(0,1) model 


Red; = bo + 6iGreenzy_1 + ut. 


# estimate spurious AR model 

summary (dynlm(RWs[, 2] ~ L(RWs[, 3])))$coefficients 

#> Estimate Std. Error t value Prt) 
#> (Intercept) -3.459488 0.3635104 -9.516889 1.354156e-15 
#> L(RWsS i 31) T1 0/7195 0- 12508% 7.217687 1.135828e=10 


The result is obviously spurious: the coefficient on Green;_, is estimated to be 
about 1 and the p-value of 1.14- 107° of the corresponding t-test indicates that 
the coefficient is highly significant while its true value is in fact zero. 


As an empirical example, consider the U.S. unemployment rate and the Japanese 
industrial production. Both series show an upward trending behavior from the 
mid-1960s through the early 1980s. 


# plot U.S. unemployment rate & Japanese industrial production 
plot (merge(as.zoo(USUnemp) , as.zoo(JPIndProd)), 
plot ctype = “single; 


col = c("darkred", "steelblue"), 

lwd = 2, 

xlab = "Date", 

ylab = oun 

main = "Spurious Regression: Macroeconomic Time series") 
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# add a legend 

legend("topleft", 
legend = c("USUnemp", "JPIndProd"), 
col = c("darkred", "steelblue"), 
lwd = c(2, 2)) 


Spurious Regression: Macroeconomic Time series 





= — USUnemp 
—— JPlndPro 
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Date 


# estimate regression using data from 1962 to 1985 
SR_Unemp1 <- dynlm(ts(USUnemp["1962::1985"]) ~ ts(JPIndProd["1962: :1985"])) 
coeftest(SR_Unemp1, vcov = sandwich) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 
#> (Intercept) -2. 31452 1 120715271193 0.0367 * 


#> ts (JPIndProd["1962::1985"]) 2.22057 0.29233 7.5961 2.227e-11 *** 
#> --- 
e je codes: O err 02001 “ex! 0.01 D 0.05 a O § "A 


A simple regression of the U.S. unemployment rate on Japanese industrial pro- 
duction using data from 1962 to 1985 yields 


U.S.UR, = — 2.37 + 2.22 log(JapaneseIP,). (14.9) 
(1.12) (0.29) 


This appears to be a significant relationship: the t-statistic of the coefficient on 
log(JapaneseIP,) is bigger than 7. 


# Estimate regression using data from 1986 to 2012 
SR_Unemp2 <- dynlm(ts(USUnemp["1986::2012"]) ~ ts(JPIndProd["1986: :2012"])) 
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coeftest(SR_Unemp2, vcov = sandwich) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 41.7763 5.4066 7.7270 6.596e-12 *** 
#> ts(JPIndProd["1986::2012"]) -7.7771 1.1714 -6.6391 1.386e-09 *** 
#> --- 

HV OCgnif. codes: O C rk* TO OOIE O OI WMO OS Ont! It 


When estimating the same model, this time with data from 1986 to 2012, we 
obtain 


U.S.U Ri = a — oe log( Japanese! P); (14.10) 


which surprisingly is quite different. (14.9) indicates a moderate positive re- 
lationship, in contrast to the large negative coefficient in (14.10). This phe- 
nomenon can be attributed to stochastic trends in the series: since there is no 
economic reasoning that relates both trends, both regressions may be spurious. 


Testing for a Unit AR Root 


A formal test for a stochastic trend has been proposed by Dickey and Fuller 
(1979) which thus is termed the Dickey-Fuller test. As discussed above, a time 
series that follows an AR(1) model with 6, = 1 has a stochastic trend. Thus, 
the testing problem is 


Ao: 6, =1 vs. Ay: |Bi| <1. 


The null hypothesis is that the AR(1) model has a unit root and the alternative 
hypothesis is that it is stationary. One often rewrites the AR(1) model by 
subtracting Y;_, on both sides: 


Yı = bo + iYi + ue & AY, = po + 6Yi-1+ ut (14.11) 





where 6 = bı — 1. The testing problem then becomes 
Hjp:6=0 vs. Ay, :6<0 


which is convenient since the corresponding test statistic is reported by many 
relevant R functions.! 


The Dickey-Fuller test can also be applied in an AR(p) model. The Augmented 
Dickey-Fuller (ADF) test is summarized in Key Concept 14.8. 





1The t-statistic of the Dickey-Fuller test is computed using homoskedasticity-only stan- 
dard errors since under the null hypothesis, the usual t-statistic is robust to conditional het- 
eroskedasticity. 
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Key Concept 14.8 
The ADF Test for a Unit Root 


Consider the regression 


AV = Po ae ONG a te qo te FAN po te a yA A he 
(14.12) 


The ADF test for a unit autoregressive root tests the hypothesis 
Ho : 6 = O (stochastic trend) against the one-sided alternative 
H; : ô < 0 (stationarity) using the usual OLS t-statistic. 


If it is assumed that Y; is stationary around a deterministic linear time 
trend, the model is augmented by the regressor t: 


AY, = Bo + at + O¥%—-1 + A + yoaAYi-2 + ++ A + ut, 
(14.13) 


where again Ho : ô = 0 is tested against Hy : ô < 0. 


The optimal lag length p can be estimated using information criteria. 
In (14.12), p = 0 (no lags of AY; are used as regressors) corresponds to 
a simple AR(1). 


Under the null, the t-statistic corresponding to Hp : 6 = 0 does not 
have a normal distribution. The critical values can only be obtained 
from simulation and differ for regressions (14.12) and (14.13) since the 
distribution of the ADF test statistic is sensitive to the deterministic 
components included in the regression. 





Critical Values for the ADF Statistic 


Key Concept 14.8 states that the critical values for the ADF test in the regres- 
sions (14.12) and (14.13) can only be determined using simulation. The idea of 
the simulation study is to simulate a large number of ADF test statistics and 
use them to estimate quantiles of their asymptotic distribution. This section 
shows how this can be done using R. 


First, consider the following AR(1) model with intercept 
Yi 5AF t, Zt = p%-1t ue. 
This can be written as 
Y, = (1 — p)a + pyz-1 + ut, 


i.e., Y, is a random walk without drift under the null p = 1. One can show that 
Y; is a stationary process with mean a for |p| < 1. 
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The procedure for simulating critical values of a unit root test using the t-ratio 
of 6 in (14.11) is as follows: 


e Simulate N random walks with n observations using the data generating 
process 


Yi 50+ 2%, 2% = p%1_t Ut, 


t=1,...,n where N and n are large numbers, a is a constant and u is a 
zero mean error term. 


e For each random walk, estimate the regression 
AY; = Bo + OYt-1 + Ut 
and compute the ADF test statistic. Save all N test statistics. 


e Estimate quantiles of the distribution of the ADF test statistic using the 
N test statistics obtained from the simulation. 


For the case with drift and linear time trend we replace the data generating 
process by 


Y,=a+b-t+2, 2 = p1 + ut (14.14) 


where b-t is a linear time trend. Y; in (14.14) is a random walk with (without) 
drift if b 4 0 (b = 0) under the null of p = 1 (can you show this?). We estimate 
the regression 


AY; = Bo +a: t+ Yii + ut. 


Loosely speaking, the precision of the estimated quantiles depends on two fac- 
tors: n, the length of the underlying series and N, the number of test statistics 
used. Since we are interested in estimating quantiles of the asymptotic distribu- 
tion (the Dickey-Fuller distribution) of the ADF test statistic both using many 
observations and large number of simulated test statistics will increase the preci- 
sion of the estimated quantiles. We choose n = N = 1000 as the computational 
burden grows quickly with n and N. 


# repetitions 
N <- 1000 


# observations 
n <- 1000 


# define constant, trend and rho 
drift <- 0.5 
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trend <- i:n 
rho <- 1 


# function which simulates an AR(1) process 
AR1 <- function(rho) { 
out <- numeric(n) 
for(i in 2:n) { 
out[i] <- rho * out[i-1] + rnorm(1) 
} 
return (out) 


} 


# simulate from DGP with constant 
RWD <- ts(replicate(n = N, drift + AR1(rho))) 


# compute ADF test statistics and store them in 'ADFD' 
ADFD <- numeric(N) 


for(i in 1:ncol(RWD)) { 
ADFD[i] <- summary ( 
dynlm(diff(RWD[, i], 1) ~ LCRWD[, i], 1)))$coef[2, 3] 


# simulate from DGP with constant and trend 
RWDT <- ts(replicate(n = N, drift + trend + AR1(rho))) 


# compute ADF test statistics and store them in 'ADFDT' 
ADFDT <- numeric(N) 


for(i in 1:ncol(RWDT)) { 
ADFDT[i] <- summary( 
dynlm(diff(RWDT[L, i], 1) ~ LCRWDT[, i], 1) + trend(RWDT[, i])) 
)$coef[2, 3] 
} 


# estimate quantiles for ADF regression with a drift 
round(quantile(ADFD, c(0.1, 0.05, 0.01)), 2) 

#> 10% 5% 1% 

#> -2.62 -2.83 -3.39 


# estimate quantiles for ADF regression with drift and trend 
round(quantile(ADFDT, c(0.1, 0.05, 0.01)), 2) 

#> 10% 5% 1% 

ize halal EEAS EESE 
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The estimated quantiles are close to the large-sample critical values of the ADF 
test statistic reported in Table 14.4 of the book. 


Table 14.2: Large Sample Critical Values of ADF Test 


Deterministic Regressors 10% 5% 1% 


Intercept only -2.57 -2.86 -3.43 
Intercept and time trend -3.12 -3.41 -3.96 





The results show that using standard normal critical values is erroneous: the 
5% critical value of the standard normal distribution is —1.64. For the Dickey- 
Fuller distributions the estimated critical values are —2.87 (drift) and —3.43 
(drift and linear time trend). This implies that a true null (the series has a 
stochastic trend) would be rejected far too often if inappropriate normal critical 
values were used. 


We may use the simulated test statistics for a graphical comparison of the 
standard normal density and (estimates of) both Dickey-Fuller densities. 


# plot standard normal density 
curve (dnorm(x), 
from =- 6, to = 3; 
ylim = c(0, 0.6), 
lty aes 
ylab = "Density", 
<labe=—Jt—stabastace: 


main = "Distributions of ADF Test Statistics", 
col = "darkred", 
lwd = 2) 


# plot density estimates of both Dickey-Fuller distributions 
lines(density(ADFD), lwd = 2, col = "darkgreen") 
lines(density(ADFDT), lwd = 2, col = "blue") 


# add a legend 
legend("topleft", 
CONCORD tole UDriftirrendt)i, 
col = c("darkred", "darkgreen", "blue"), 
lh = Cp hy ae 
wd = 2) 
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Distributions of ADF Test Statistics 
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The deviations from the standard normal distribution are significant: both 
Dickey-Fuller distributions are skewed to the left and have a heavier left tail 
than the standard normal distribution. 


Does U.S. GDP Have a Unit Root? 


As an empirical example, we use the ADF test to assess whether there is a 
stochastic trend in U.S. GDP using the regression 


Alog(GDP;) = bo + at + bı log(GDP,_1) + B2Alog(GDP,_1) + B3Alog(GDP,_2) + uz. 


# generate log GDP series 
LogGDP <- ts(log(GDP["1962: :2012"])) 


# estimate the model 
coeftest ( 
dynlm(diff(LogGDP) ~ trend(LogGDP, scale = F) + L(LogGDP) 
+ diff(L(LogGDP)) + diff(L(LogGDP), 2))) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 0.27877045 0.11793233 2.3638 0.019066 * 
#> trend(LogGDP, scale = F) 0.00023818 0.00011090 2.1476 0.032970 * 
#> L(LogGDP) -0.03332452 0.01441436 -2.3119 0.021822 * 
#> diff (L(LogGDP)) 0.08317976 0.11295542 0.7364 0.462371 
#> diff(L(LogGDP), 2) 0.18763384 0.07055574 2.6594 0.008476 ** 
#> --- 


#> Stgntf. codes: 0 '***' 0.001 ‘**' 0.01 '*" 0.05 ".' 0.1 " " 1 
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The estimation yields 


A log(GDP,) = 0.28 + 0.0002t — 0.033 log(GDP,_1) 
(0.118) (0.0001) (0.014) 


.083 A1 DP 188A] DP. 
e A oe(GDP.») + 388A log(GDR2) + 


so the ADF test statistic is t = —0.033/0.014 = —2.35. The corresponding 5% 
critical value from Table 14.2 is —3.41 so we cannot reject the null hypothe- 
sis that log(GDP) has a stochastic trend in favor of the alternative that it is 
stationary around a deterministic linear time trend. 


The ADF test can be done conveniently using ur .df() from the package urca. 


# test for unit root in GDP using 'ur.df()' from the package 'urca' 
summary (ur. df (LogGDP , 
type = “trend; 
lags = 2; 
selectlags = "Fixed")) 
#> 
A> HHEBHHERHHERAEHAEHAHEAHRARAARAHAAHARAAAAAARARAAR 
#> # Augmented Dickey-Fuller Test Unit Root Test # 
A> HHEBHHRAHERAEHAEHAHAAAHRARAARAHAAHARAARAAARARAAR 


#> 

#> Test regression trend 

#> 

#> 

#> Call: 

#> imCformula =z dif] = 2.lagal + 1 ttt +2. duff. vagy 

#> 

#> Residuals: 

#> Min 1Q Median 3Q Maz 

#> -0.025580 -0.004109 0.000321 0.004869 0.032781 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 0.2790086 0.1180427 2.364 0.019076 * 

#> z.lag.1 -0. 039332458 0.0144144 -2.312 0.021822 * 

#> tt 0.0002382 0.0001109 2.148 0.032970 * 

#> z.diff.lag1 0.2708136 0.0697696 3.882 0.000142 *** 

#> z.diff.lag2 0.1876338 0.0705557 2.659 0.008476 ** 

i ——— 

H> Signi- codes: O'*** 0,001 "x" (0701 le! OOS! Ont "V4 
#> 

#> Residual standard error: 0.007704 on 196 degrees of freedom 
#> Multiple R-squared: 0.1783, Adjusted R-squared: 0.1616 


#> F-statistic: 10.63 on 4 and 196 DF, p-value: 8.076e-08 
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#> 

#> 

#> Value of test statistic ts: 2.39119 11.2558 4.267 
#> 

#> Critical values for test statistics: 

#> Ipet Spet 10pct 

#> tau? =3:-99 -3 AS SoS 

#> phi2 6.22 4.75 4.07 

#>ONtS (8243 C LIEL 


The first test statistic at the bottom of the output is the one we are interested 
in. The number of test statistics reported depends on the test regression. For 
type = "trend", the second statistics corresponds to the test that there is no 
unit root and no time trend while the third one corresponds to a test of the 
hypothesis that there is a unit root, no time trend and no drift term. 


14.8 Nonstationarity II: Breaks 


When there are discrete (at a distinct date) or gradual (over time) changes in 
the population regression coefficients, the series is nonstationary. These changes 
are called breaks. There is a variety of reasons why breaks can occur in macroe- 
conomic time series but most often they are related to changes in economic 
policy or major changes in the structure of the economy. See Chapter 14.7 of 
the book for some examples. 


If breaks are not accounted for in the regression model, OLS estimates will reflect 
the average relationship. Since these estimates might be strongly misleading 
and result in poor forecast quality, we are interested in testing for breaks. One 
distinguishes between testing for a break when the date is known and testing 
for a break with an unknown break date. 


Let 7 denote a known break date and let D,(7) be a binary variable indicating 
time periods before and after the break. Incorporating the break in an ADL(1,1) 
regression model yields 


Yı =Bo + Br: Yi-1 + 61 Xt—1 + YoDi(7T) + V1 [Di(7) » Y1] 
+ y2 [Di(7) - Xt-1] + ut, 


where we allow for discrete changes in 6o, 81 and (2 at the break date r. The 
null hypothesis of no break, 


Ho : Yo = 1 = %2 = 9, 


can be tested against the alternative that at least one of the y’s is not zero using 
an F-Test. This idea is called a Chow test after Gregory Chow (1960). 
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When the break date is unknown the Quandt likelihood ratio (QLR) test 
(Quandt, 1960) may be used. It is a modified version of the Chow test which 
uses the largest of all F-statistics obtained when applying the Chow test for 
all possible break dates in a predetermined range [70,71]. The QLR test is 
summarized in Key Concept 14.9. 


Key Concept 14.9 
The QLR Test for Coefficient Stability 


The QLR test can be used to test for a break in the population regression 
function if the date of the break is unknown. The QLR test statistic is 
the largest (Chow) F(T) statistic computed over a range of eligible break 
dates To < T < 7: 


QLR = max |F (To), F(To + 1),.-., F(T1)]- (14.15) 


The most important properties are: 


e The QLR test can be applied to test whether a subset of the co- 
efficients in the population regression function breaks but the test 
also rejects if there is a slow evolution of the regression function. 
When there is a single discrete break in the population regression 
function that lying at a date within the range tested, the QLR test 
statistic is F(7) and T/T is a consistent estimator of fraction of the 
sample at which the break is. 


The large-sample distribution of QLR depends on q, the number 
of restrictions being tested and both ratios of end points to the 
sample size, To/T, Tı /T. 


Similar to the ADF test, the large-sample distribution of QLR is 
nonstandard. Critical values are presented in Table 14.5 of the 
book. 





Has the Predictive Power of the term spread been stable? 


Using the QLR statistic we may test whether there is a break in the coefficients 
on the lags of the term spread in (14.5), the ADL(2,2) regression model of GDP 
growth. Following Key Concept 14.9 we modify the specification of (14.5) by 
adding a break dummy D(r) and its interactions with both lags of term spread 
and choose the range of break points to be tested as 1970:Q1 - 2005:Q2 (these 
periods are the center 70% of the sample data from 1962:Q2 - 2012:Q4). Thus, 
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the model becomes 


GDPGR; = bo + {iGDPGRi_1 + b2GDPGRi_2 
+ b3TSpreadt—ı + BaT Spread;:_2 

+ y D(T) + y2(D(r) - TSpread;_1) 
+ ¥3(D(r) - TSpreadt_2) 


+ ut. 





Next, we estimate the model for each break point and compute the F-statistic 
corresponding to the null hypothesis Ho : 71 = y2 = y3 = 0. The QLR-statistic 
is the largest of the F-statistics obtained in this manner. 


# set up a range of possible break dates 
tau <- seq(1970, 2005, 0.25) 


# initialize vector of F-statistics 
Fstats <- numeric(length(tau) ) 


# estimation loop over break dates 
for(i in 1:length(tau)) { 


# set up dummy variable 
D <- time(GDPGrowth_ts) > tauli] 


# estimate ADL(2,2) model with intercations 

test <- dynlm(GDPGrowth_ts ~ L(GDPGrowth_ts) + L(GDPGrowth_ts, 2) + 
D*L(TSpread_ts) + D*L(TSpread_ts, 2), 
start = c(1962, 1), 
end = c(2012, 4)) 


# compute and save the F-statistic 
Fstats[i] <- linearHypothesis(test, 
c("DTRUE=0", "DTRUE:L(TSpread_ts)", 
"DTRUE:L(TSpread_ts, 2)"), 
vcov. = sandwich) $F [2] 


We determine the QLR statistic using max(). 


# identify QLR statistic 
QLR <- max(Fstats) 

QLR 

#> [1] 6.651156 


430CHAPTER 14. INTRODUCTION TO TIME SERIES REGRESSION AND FORECASTING 


Let us check that the QL R-statistic is the F-statistic obtained for the regression 
where 1980:Q4 is chosen as the break date. 


# identify the time period where the QLR-statistic is observed 
as.yearqtr (tau[which.max(Fstats)]) 
#> [1] "1980 Q4" 


Since q = 3 hypotheses are tested and the central 70% of the sample are con- 
sidered to contain breaks, the corresponding 1% critical value of the QLR test 
is 6.02. We reject the null hypothesis that all coefficients (the coefficients on 
both lags of term spread and the intercept) are stable since the computed QLR- 
statistic exceeds this threshold. Thus evidence from the QLR test suggests that 
there is a break in the ADL(2,2) model of GDP growth in the early 1980s. 


To reproduce Figure 14.5 of the book, we convert the vector of sequential break- 
point F-statistics into a time series object and then generate a simple plot with 
some annotations. 


# series of F-statistics 

Fstatsseries <- ts(Fstats, 
start = tauli], 
end = tau[length(tau)], 
frequency = 4) 


# plot the F-statistics 
plot (Fstatsseries, 
xlim = c(1960, 2015), 
ylim = cU 7:5), 
lwd = 2, 
col = "steelblue", 
ylab = UFE- Statistici, 
xlab = "Break Date", 
main = "Testing for a Break in GDP ADL(2,2) Regression at Different Dates") 


# dashed horizontal lines for critical values and QLR statistic 
abline(h = 4.71, lty = 2) 

abline(h = 6.02, lty = 2) 

segments(0, QLR, 1980.75, QLR, col = "darkred") 

text (2010, 6.2, "1% Critical Value") 

text.(2010, 4.9. 15/ Critical Value) 

text (1980.75, QLR+0.2, "QLR Statistic") 
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Pseudo Out-of-Sample Forecasting 


Pseudo out-of-sample forecasts are used to simulate the out-of-sample perfor- 
mance (the real time forecast performance) of a time series regression model. In 
particular, pseudo out-of-sample forecasts allow estimation of the RM SFE of 
the model and enable researchers to compare different model specifications with 
respect to their predictive power. Key Concept 14.10 summarizes this idea. 


Key Concept 14.10 
Pseudo Out-of-Sample Forecasting 


. Divide the sample data into s = T — P and P subsequent ob- 
servations. The P observations are used as pseudo-out-of-sample 
observations. 


. Estimate the model using the first s observations. 


. Compute the pseudo-forecast Y .41)5- 


~w 


. Compute the pseudo-forecast-error ùs ey = Youn = Y sale: 





. Repeat steps 2 trough 4 for all remaining pseudo-out-of-sample 
dates, i.e., reestimate the model at each date. 





Did the Predictive Power of the Term Spread Change During the 
2000s? 


The insight gained in the previous section gives reason to presume that the 
pseudo-out-of-sample performance of ADL(2,2) models estimated using data 
after the break in the early 1980s should not deteriorate relative to using the 
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whole sample: provided that the coefficients of the population regression func- 
tion are stable after the potential break in 1980:Q4, these models should have 
good predictive power. We check this by computing pseudo-out-of-sample fore- 
casts for the period 2003:Q1 - 2012:Q4, a range covering 40 periods, where the 
forecast for 2003:Q1 is done using data from 1981:Q1 - 2002:Q4, the forecast for 
2003:Q2 is based on data from 1981:Q1 - 2003:Q1 and so on. 


Similarly as for the QL.R-test we use a for () loop for estimation of all 40 models 
and gather their SE Rs and the obtained forecasts in a vector which is then used 
to compute pseudo-out-of-sample forecast errors. 


# end of sample dates 
EndOfSample <- seq(2002.75, 2012.5, 0.25) 


# initialize matriz forecasts 
forecasts <- matrix(nrow = 1, 
ncol = length(EndOfSample) ) 


# initialize vector SER 
SER <- numeric(length(End0fSample) ) 


# estimation loop over end of sample dates 
for(i in 1:length(End0fSample)) { 


# estimate ADL(2,2) model 
m <- dynlm(GDPGrowth_ts ~ L(GDPGrowth_ts) + L(GDPGrowth_ts, 2) 
+ L(TSpread_ts) + L(TSpread_ts, 2), 
start = c(1981, 1), 
end = EndOfSample [i] ) 


SER[i] <- summary (m)$sigma 


# sample data for one-period ahead forecast 
s <- window(ADLdata, EndOfSample[i] - 0.25, EndOfSample[i]) 


# compute forecast 
forecasts[i] <- coef(m) %*% c(1, s[t1, 1], sl2, 1], s[1, 2], s[2, 2]) 
} 


# compute psuedo-out-of-sample forecast errors 
POOSFCE <- c(window(GDPGrowth_ts, c(2003, 1), c(2012, 4))) - forecasts 


We next translate the pseudo-out-of-sample forecasts into an object of class ts 
and plot the real GDP growth rate against the forecasted series. 
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# series of pseudo-out-of-sample forecasts 
PSOSSFc <- ts(c(forecasts) , 

start = 2003, 

end = 201257 

frequency = 4) 


# plot the GDP growth time series 
plot (window(GDPGrowth_ts, c(2003, 1), c(2012, 4)), 
col = "steelblue", 
lwd = 2, 
ylab = "Percent", 
main = "Pseudo-Out-Of-Sample Forecasts of GDP Growth") 


# add the series of pseudo-out-of-sample forecasts 


lines (PSOSSFc, 
lwd = 2, 
lty = 2) 


# shade area between curves (the pseudo forecast error) 
polygon(c(time(PSOSSFc) , rev(time(PSOSSFc))), 
c(window(GDPGrowth_ts, c(2003, 1), c(2012, 4)), rev(PSOSSFc)), 
col = alpha("blue", alpha = 0.3), 
border = NA) 


# add a legend 
legend("bottomleft", 
hey SCs Ay p 
lwd = c(2, 2, 10), 
col c("steelblue", "black", alpha("blue", alpha = 0.3)), 
legend = c("Actual GDP growth rate", 
"Forecasted GDP growth rate", 
"Pseudo forecast Error") ) 
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Pseudo-Out-Of-Sample Forecasts of GDP Growth 
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Apparently, the pseudo forecasts track the actual GDP growth rate quite well, 
except for the kink in 2009 which can be attributed to the recent financial crisis. 


The SER of the first model (estimated using data from 1981:Q1 to 2002:Q4) is 
2.39 so based on the in-sample fit we would expect the out of sample forecast 
errors to have mean zero and a root mean squared forecast error of about 2.39. 


# SER of ADL(2,2) mode using data from 1981:Q1 - 2002:Q4 
SER[1] 
#> [1] 2.389773 


The root mean squared forecast error of the pseudo-out-of-sample forecasts is 
somewhat larger. 


# compute root mean squared forecast error 
sd (POOSFCE) 
#> [1] 2.667612 


An interesting hypothesis is whether the mean forecast error is zero, that is the 
ADL(2,2) forecasts are right, on average. This hypothesis is easily tested using 
the function t.test(. 


# test if mean forecast error is zero 

t. test (POOSFCE) 

#> 

#> One Sample t-test 

#> 

#> data: POOSFCE 

FG ial OO2Ss) af — 59) p value = 10 nt2o7, 

#> alternative hypothesis: true mean is not equal to O 
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#> 95 percent confidence interval: 
#> -1.5078876 0.1984001 

#> sample estimates: 

#> mean of x 

#> -0.6547438 


The hypothesis cannot be rejected at the 10% significance level. Altogether the 
analysis suggests that the ADL(2,2) model coefficients have been stable since 
the presumed break in the early 1980s. 


14.9 Can You Beat the Market? (Part IT) 


The dividend yield (the ratio of current dividends to the stock price) can be 
considered as an indicator of future dividends: if a stock has a high current 
dividend yield, it can be considered undervalued and it can be presumed that 
the price of the stock goes up in the future, meaning that future excess returns 


go up. 


This presumption can be examined using ADL models of excess returns, where 
lags of the logarithm of the stock’s dividend yield serve as additional regressors. 


Unfortunately, a graphical inspection of the time series of the logarithm of the 
dividend yield casts doubt on the assumption that the series is stationary which, 
as has been discussed in Chapter 14.7, is necessary to conduct standard inference 
in a regression analysis. 


# plot logarithm of dividend yield series 
plot (StockReturns[, 2], 

col = "steelblue", 

lwd = 2, 

ylab = "Logarithm", 

main = "Dividend Yield for CRSP Index") 
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Dividend Yield for CRSP Index 
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The Dickey-Fuller test statistic for an autoregressive unit root in an AR(1) 
model with drift provides further evidence that the series might be nonstation- 
ary. 


# test for unit root in GDP using 'ur.df()' from the package 'urca' 
summary (ur.df(window(StockReturns[, 2], 
ce{1960,1), 
c(2002, 12)), 
eye = Vehar 
lags = 0)) 
#> 
> TEES EE EEE EI UTE IE IE 
#> # Augmented Dickey-Fuller Test Unit Root Test # 
H> AEE LETTE AE HELTER TE TELE TETE HUHHH TE TE TE LETETE TE TETLTETE TEER TET TE 


#> 

#> Test regression drift 

#> 

#> 

#> Call: 

#> Im(formula = z.diff ~ z.lag.1 + 1) 

#> 

#> Residuals: 

#> Min 1Q Median 3Q Maz 

#> -14.3540 -2.9118 -0.2952 2.6374 25.5170 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 
#> (Intercept) -2.740964 2.080039 -1.318 0.188 
#> z.lag.1 -0.007652 0.005989 -1.278 0.202 


#> 
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#> Residual standard error: 4.45 on 513 degrees of freedom 

#> Multiple R-squared: 0.003172, Adjusted R-squared: 0.001229 
#> F-statistic: 1.633 on 1 and 513 DF, p-value: 0.2019 

#> 


#> 

#> Value of test-statistic ts: -1.2777 0.9339 
#> 

#> Critical values for test statistics: 

#> Ipet S5pct 10pct 


#> tau2 -3.43 -2.86 -2.57 
#> phil 6.43 4.59 3.78 


We use window() to get observations from January 1960 to December 2012 only. 


Since the t-value for the coefficient on the lagged logarithm of the dividend yield 
is —1.27, the hypothesis that the true coefficient is zero cannot be rejected, even 
at the 10% significance level. 


However, it is possible to examine whether the dividend yield has predictive 
power for excess returns by using its differences in an ADL(1,1) and an ADL(2,2) 
model (remember that differencing a series with a unit root yields a stationary 
series), although these model specifications do not correspond to the economic 
reasoning mentioned above. Thus, we also estimate an ADL(1,1) regression 
using the level of the logarithm of the dividend yield. 


That is we estimate three different specifications: 


excess returns, = Bo + 8 excess returns,_1 + B3A log(dividendyteldy_1) + uz 








excess returns, = Bo + Byexcessreturns;_, + Bgexcess returns;,_2 
+ B3A log(dividendyield;_1) + 64A log(dividendyield;_2) + ut 


excess returns, = Bo + 8 excess returns: + Bs log(dividendyield:_1) + ut 


# ADL(1,1) (1st difference of log dividend yield) 

CRSP_ADL_1 <- dynlm(ExReturn ~ L(ExReturn) + d(L(1n_DivYield)), 
data = StockReturns, 
start = c(1960, 1), end = c(2002, 12)) 


# ADL(2,2) (1st & 2nd differences of log dividend yield) 
CRSP_ADL_2 <- dynlm(ExReturn ~ L(ExReturn) + L(ExReturn, 2) 
+ d(L(1n_DivYield)) + d(L(1n_DivYield, 2)), 
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data = StockReturns, 
start = c(1960, 1), end = c(2002, 12)) 


# ADL(1,1) (level of log dividend yield) 

CRSP_ADL_3 <- dynlm(ExReturn ~ L(ExReturn) + L(1n_DivYield), 
data = StockReturns, 
start = c(1960, 1), end = c(1992, 12)) 


# gather robust standard errors 

rob_se_CRSP_ADL <- list (sqrt (diag(sandwich(CRSP_ADL_1))), 
sqrt (diag (sandwich(CRSP_ADL_2))), 
sqrt (diag (sandwich (CRSP_ADL_3)))) 


A tabular representation of the results can then be generated using 
stargazer(). 


stargazer(CRSP_ADL_1, CRSP_ADL_2, CRSP_ADL_3, 
title = "ADL Models of Monthly Excess Stock Returns", 
header = FALSE, 
type = lateri, 


column.sep.width = "-5pt", 

no.space = T, 

digits = 3, 

colum. labels = COADL(1, 190. AD E252) ie "MADEC, 1D"), 

dep.var.caption = "Dependent Variable: Excess returns on the CSRP value-weighted in 


dep.var.labels.include = FALSE, 

covariate.labels = c("$excess return_{t-1}$", 
"S$excess return_{t-2}$", 
"$1°{st} diff log(dividend yield_{t-1})$", 
"$1°{st} diff log(dividend yield_{t-2})$", 
"$log(dividend yield_{t-1})$", 
"Constant"), 

se = rob_se_CRSP_ADL) 


For models (1) and (2) none of the individual t-statistics suggest that the co- 
efficients are different from zero. Also, we cannot reject the hypothesis that 
none of the lags have predictive power for excess returns at any common level 
of significance (an F-test that the lags have predictive power does not reject for 
both models). 


Things are different for model (3). The coefficient on the level of the logarithm 
of the dividend yield is different from zero at the 5% level and the F-test rejects, 
too. But we should be suspicious: the high degree of persistence in the dividend 
yield series probably renders this inference dubious because t- and F-statistics 
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Table 14.3: ADL Models of Monthly Excess Stock Returns 








Dependent Variable: Excess returns on the CSRP Value-Weighted Index 











ADL(1,1) ADL(2,2) ADL(1,1) 
(1) (2) (3) 
excessreturnt_1 0.059 0.042 0.078 
(0.158) (0.162) (0.057) 
excessreturnt—2 —0.213 
(0.193) 
1 dif flog(dividendyield,_1) 0.009 —0.012 
(0.157) (0.163) 
1°'dif flog(dividendyield;_2) —0.161 
(0.185) 
log(dividendyield,_1) 0.026** 
(0.012) 
Constant 0.309 0.372* 8.987** 
(0.199) (0.208) (3.912) 
Observations 516 516 396 
R? 0.003 0.007 0.018 
Adjusted R? —0.001 —0.001 0.013 
Residual Std. Error 4.338 (df = 513) 4.337 (df = 511) 4.407 (df = 393) 
F Statistic 0.653 (df = 2; 513) 0.897 (df = 4; 511) 3.683** (df = 2; 393) 








Note: 


*p<0.1; *p<0.05; **p<0.01 
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may follow distributions that deviate considerably from their theoretical large- 
sample distributions such that the usual critical values cannot be applied. 


If model (3) were of use for predicting excess returns, pseudo-out-of-sample 
forecasts based on (3) should at least outperform forecasts of an intercept-only 
model in terms of the sample RMSFE. We can perform this type of comparison 
using R code in the fashion of the applications of Chapter 14.8. 


# end of sample dates 
EndOfSample <- as.numeric(window(time(StockReturns), c(1992, 12), c(2002, 11))) 


# initialize matrix forecasts 
forecasts <- matrix(nrow = 2, 
ncol = length(EndOfSamp1le) ) 


# estimation loop over end of sample dates 
for(i in 1:length(End0fSample)) { 


# estimate model (3) 

mod3 <- dynlm(ExReturn ~ L(ExReturn) + L(1n_DivYield), data = StockReturns, 
start = ¢c(1960, 1), 
end = EndOfSample[i]) 


# estimate intercept only model 

modconst <- dynlm(ExReturn ~ 1, data = StockReturns, 
start = c(1960, 1), 
end = EndOfSample [i] ) 


# sample data for one-period ahead forecast 
t <- window(StockReturns, EndOfSample[i], EndOfSample [i] ) 


# compute forecast 
forecasts[, i] <- c(coef(mod3) %*% c(1, t[1], t[2]), coef (modconst)) 


# gather data 

d <- cbind("Excess Returns" = c(window(StockReturns[,1], c(1993, 1), c(2002, 12))), 
"Model (3)" = forecasts[1,], 
"Intercept Only" = forecasts[2,], 
"Always Zero" = 0) 


# Compute RMSFEs 

c("ADL model (3)" = sd(d[, 1] - dL, 2]), 
"Intercept-only model" = sd(d[, 1] = di, 3]), 
"Always zero" = sd(d[,1] - dL, 4])) 
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#> ADL model (3) Intercept-only model Always zero 
#> 4.043757 4.000221 3. 995428 


The comparison indicates that model (3) is not useful since it is outperformed 
in terms of sample RMSFE by the intercept-only model. A model forecasting 
excess returns always to be zero has an even lower sample RMSFE. This find- 
ing is consistent with the weak-form efficiency hypothesis which states that all 
publicly available information is accounted for in stock prices such that there is 
no way to predict future stock prices or excess returns using past observations, 
implying that the perceived significant relationship indicated by model (3) is 
wrong. 


Summary 


This chapter dealt with introductory topics in time series regression analysis, 
where variables are generally correlated from one observation to the next, a 
concept termed serial correlation. We presented several ways of storing and 
plotting time series data using R and used these for informal analysis of economic 
data. 


We have introduced AR and ADL models and applied them in the context of 
forecasting of macroeconomic and financial time series using R. The discussion 
also included the topic of lag length selection. It was shown how to set up a 
simple function that computes the BIC for a model object supplied. 


We have also seen how to write simple R code for performing and evaluat- 
ing forecasts and demonstrated some more sophisticated approaches to conduct 
pseudo-out-of-sample forecasts for assessment of a model’s predictive power for 
unobserved future outcomes of a series, to check model stability and to compare 
different models. 


Furthermore, some more technical aspects like the concept of stationarity were 
addressed. This included applications to testing for an autoregressive unit root 
with the Dickey-Fuller test and the detection of a break in the population re- 
gression function using the QLR statistic. For both methods, the distribution 
of the relevant test statistic is non-normal, even in large samples. Concerning 
the Dickey-Fuller test we have used R’s random number generation facilities to 
produce evidence for this by means of a Monte-Carlo simulation and motivated 
usage of the quantiles tabulated in the book. 


Also, empirical studies regarding the validity of the weak and the strong form 
efficiency hypothesis which are presented in the applications Can You Beat the 
Market? Part I & II in the book have been reproduced using R. 


In all applications of this chapter, the focus was on forecasting future outcomes 
rather than estimation of causal relationships between time series variables. 


442 CHAPTER 14. INTRODUCTION TO TIME SERIES REGRESSION AND FORECASTING 


However, the methods needed for the latter are quite similar. Chapter 15 is 
devoted to estimation of so called dynamic causal effects. 


Chapter 15 


Estimation of Dynamic 
Causal Effects 


It sometimes is of interest to know the size of current and future reaction of Y 
to a change in X. This is called the dynamic causal effect on Y of a change 
in X. This Chapter we discusses how to estimate dynamic causal effects in R 
applications, where we investigate the dynamic effect of cold weather in Florida 
on the price of orange juice concentrate. 


The discussion covers: 


e estimation of distributed lag models 
e heteroskedasticity- and autocorrelation-consistent (HAC) standard errors 
e generalized least squares (GLS) estimation of ADL models 


To reproduce code examples, install the R packages listed below beforehand and 
make sure that the subsequent code chunk executes without any errors. 


e AER (Kleiber and Zeileis, 2020) 

e dynlm (Zeileis, 2019) 

e nlme (Pinheiro et al., 2020) 

e orcutt (Spada, 2018) 

e quantmod (Ryan and Ulrich, 2020) 
e stargazer (Hlavac, 2018) 


library (AER) 
library (quantmod) 
library (dyn1m) 
library (orcutt) 
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library (nlme) 
library (stargazer) 


15.1 The Orange Juice Data 


The largest cultivation region for oranges in the U.S. is located in Florida which 
usually has ideal climate for the fruit growth. It thus is the source of almost 
all frozen juice concentrate produced in the country. However, from time to 
time and depending on their severeness, cold snaps cause a loss of harvests 
such that the supply of oranges decreases and consequently the price of frozen 
juice concentrate rises. The timing of the price increases is complicated: a 
cut in today’s supply of concentrate influences not just today’s price but also 
future prices because supply in future periods will decrease, too. Clearly, the 
magnitude of today’s and future prices increases due to freeze is an empirical 
question that can be investigated using a distributed lag model — a time series 
model that relates price changes to weather conditions. 


To begin with the analysis, we reproduce Figure 15.1 of the book which displays 
plots of the price index for frozen concentrated orange juice, percentage changes 
in the price as well as monthly freezing degree days in Orlando, the center of 
Florida’s orange-growing region. 


# load the frozen orange juice data set 
data("FrozenJuice") 


# compute the price index for frozen concentrated juice 
FOJCPI <- FrozenJuice[, "price"]/FrozenJuice[, "ppi"] 
FOJC_pctc <- 100 * diff (log(FOJCPI)) 

FDD <- FrozenJuice[, "fdd"] 


# convert sertes to zts objects 
FOJCPI_xts <- as.xts(FOJCPI) 
FDD_xts <- as.xts(FrozenJuice[, 3]) 


# Plot orange juice price index 
plot (as.zoo(FOJCPI) , 


col = "steelblue", 
lwd = 2, 

xlab = "Date", 

ylab = "Price index", 


main = "Frozen Concentrated Orange Juice") 
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Frozen Concentrated Orange Juice 
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# divide plotting area 
par(mfrow = c(2, 1)) 


# Plot percentage changes in prices 
plot(as.zoo(FOJC_pctc), 
col = "steelblue", 
lwd = 2, 
xlab = "Date", 
ylab = "Percent", 
"Monthly Changes in the Price of Frozen Conentrated Orange Juice") 


main 


# plot freezing degree days 
plot (as.zoo(FDD), 
col = "steelblue", 
lwd = 2, 
xlab = "Date", 
ylab = "Freezing degree days", 
main = "Monthly Freezing Degree Days in Orlando, FL") 
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Wlonthly Changes in the Price of Frozen Conentrated Orange. 
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Periods with a high amount of freezing degree days are followed by large month- 
to-month price changes. These coinciding movements motivate a simple regres- 
sion of price changes (%ChgOJC;) on freezing degree days (F DD+) to estimate 
the effect of an additional freezing degree day one the price in the current month. 
For this, as for all other regressions in this chapter, we use T = 611 observations 
(January 1950 to December 2000). 


# simple regression of percentage changes on freezing degree days 
orange_SR <- dynlm(FOJC_pctc ~ FDD) 


coeftest (orange_SR, vcov. = vcovHAC) 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) -0.42095 0.18683 -2.2531 0.0246064 * 
#> FDD 0.46724 0.13385 3.4906 0.0005167 *** 
a e 


#> Stgntf. codes: 0 ‘***' 0.001 ‘**' 0.01 '*' 0.05 '." 0.1 ' " 1 
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Notice that the standard errors are computed using a “HAC” estimator of the 
variance-covariance matrix, see Chapter 14.5 for a discussion of this estimator. 


%ChgOJC, = —0.42 + 0.47 FDD; 
(0.19) (0.13) 


The estimated coefficient on F DD; has the following interpretation: an addi- 
tional freezing degree day in month t leads to a price increase Of 0.47 percentage 
points in the same month. 


To consider effects of cold snaps on the orange juice price over the subsequent pe- 
riods, we include lagged values of F DD, in our model which leads to a distributed 
lag regression model. We estimate a specification using a contemporaneous and 
six lagged values of FDD; as regressors. 


# distributed lag model with 6 lags of freezing degree days 
orange_DLM <- dynlm(FOJC_pctc ~ FDD + L(FDD, 1:6)) 


coeftest(orange_DLM, vcov. = vcovHAC) 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) -0.692961 0.212445 -3.2618 0.0011700 ** 
#> FDD 0.471433 0.135195 3.4871 0.0005242 *** 
HOSE CEDD Sl 6) 4 0117502101081557 VIE 77821070758853 
#> L(FDD, 1:6)2 0.058364 0.058911 0.9907 0.3222318 

Z> L (EDDY 16X3 0707/166 OTOT 1, 5782 0 1162007, 

#> L(FDD, 1:6)4 0.036304 0.029335 1.2376 0.2163670 

4> LEDD 16) 5) 070/3756 00381870 1.55730 1206535 

#> L(FDD, 1:6)6 0.050246 0.045129 1.1134 0.2659919 
i= 

Ho Signi- codes- TOKKE OL OOL Neel (ONO Ve TOROS ee Ont E] 


As the result we obtain 


%ChgOIC, =— 0.69 + 0.47FDD,+ 0.15FDD,_1 + 0.06 FDD,_2 + 0.07 FDDt_3 
(0.21) (0.14) (0.08) (0.06) (0.05) 
+ 0.04 FDD;-4 + 0.05 FDD;-5 + 0.05 FDDi-6, 
(0.03) (0.03) (0.05) 


(15.1) 


where the coefficient on FDD+—1 estimates the price increase in period t caused 
by an additional freezing degree day in the preceding month, the coefficient on 
F DD _2 estimates the effect of an additional freezing degree day two month 
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ago and so on. Consequently, the coefficients in (15.1) can be interpreted as 
price changes in current and future periods due to a unit increase in the current 
month’ freezing degree days. 


15.2 Dynamic Causal Effects 


This section of the book describes the general idea of a dynamic causal effect 
and how the concept of a randomized controlled experiment can be translated 
to time series applications, using several examples. 


In general, for empirical attempts to measure a dynamic causal effect, the as- 
sumptions of stationarity (see Key Concept 14.5) and exogeneity must hold. In 
time series applications up until here we have assumed that the model error 
term has conditional mean zero given current and past values of the regressors. 
For estimation of a dynamic causal effect using a distributed lag model, assum- 
ing a stronger form termed strict exogeneity may be useful. Strict exogeneity 
states that the error term has mean zero conditional on past, present and future 
values of the independent variables. 


The two concepts of exogeneity and the distributed lag model are summarized 
in Key Concept 15.1. 
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Key Concept 15.1 
The Distributed Lag Model and Exogeneity 
The general distributed lag model is 
ya = (n Oise i Eai r Aa te a Opel eae r e 52) 


where it is assumed that 


1. X is an exogenous variable, 


E(ut| Xt, Xt-1, Xt-2, aa a) (3 


2. (a) Xz, Yı have a stationary distribution. 


(b) (Yı, X+) and (Yi—j, X:-;) become independently distributed 
as j gets large. 


3. Large outliers are unlikely. In particular, we need that all the vari- 
ables have more than eight finite moments — a stronger assumption 
than before (four finite moments) that is required for computation 
of the HAC covariance matrix estimator. 


4. There is no perfect multicollinearity. 


The distributed lag model may be extended to include contemporaneous 
and past values of additional regressors. 


On the assumption of exogeneity 


e There is another form of exogeneity termed strict exogeneity which 
assumes 


E(u] soog A aD AA o Aia =) =0, 


that is the error term has mean zero conditional on past, present 
and future values of X. Strict exogeneity implies exogeneity (as 
defined in 1. above) but not the other way around. From this 
point on we will therefore distinguish between exogeneity and strict 
exogeneity. 


Exogeneity asin 1. suffices for the OLS estimators of the coefficient 
in distributed lag models to be consistent. However, if the the 
assumption of strict exogeneity is valid, more efficient estimators 
can be applied. 
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15.3 Dynamic Multipliers and Cumulative Dy- 
namic Multipliers 


The following terminology regarding the coefficients in the distributed lag model 
(15.2) is useful for upcoming applications: 


e The dynamic causal effect is also called the dynamic multiplier. By+1 in 
(15.2) is the h-period dynamic multiplier. 


e The contemporaneous effect of X on Y, 81, is termed the impact effect. 


e The h-period cumulative dynamic multiplier of a unit change in X and Y 
is defined as the cumulative sum of the dynamic multipliers. In particular, 
Bı is the zero-period cumulative dynamic multiplier, 81 + 82 is the one- 
period cumulative dynamic multiplier and so forth. 


The cumulative dynamic multipliers of the distributed lag model (15.2) 
are the coefficients 61, ô2,.. . , Ôr, Ôr+1 in the modified regression 


Y; = ôo + Oi; AX; + do A X41 ani bp AX ppt + Ôr+1Xt-r + Ut (15.3) 


and thus can be directly estimated using OLS which makes it convenient to 
compute their HAC standard errors. 6,4, is called the long-run cumulative 
dynamic multiplier. 


It is straightforward to compute the cumulative dynamic multipliers for (15.1), 
the estimated distributed lag regression of changes in orange juice concen- 
trate prices on freezing degree days, using the corresponding model object 
orange_DLM and the function cumsum(). 


# compute cumulative multipliers 
cum_mult <-cumsum(orange_DLM$coefficients[-1]) 


# rename entries 
names(cum_mult) <- paste(0:6, sep = "-", "period CDM") 


cum_mult 

#> O-period CDM 1-period CDM 2-period CDM 3-period CDM 4-period CDM 5-period CDM 
#> 0.4714329 0.6164542 0.6748177 0. 7489835 0. 7852874 0. 8340436 
#> 6-period CDM 

#> 0. 8842895 


Translating the distributed lag model with six lags of FDD to (15.3), we see 
that the OLS coefficient estimates in this model coincide with the multipliers 
stored in cum_mult. 
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# estimate cumulative dynamic multipliers using the modified regression 
cum_mult_reg <-dynlm(FOJC_pctc ~ d(FDD) + d(L(FDD,1:5)) + L(FDD,6)) 
coef (cum_mult_reg) [-1] 


#> d(FDD) d(L(FDD, 1:5))1 d(L(FDD, 1:5))2 d(L(FDD, 1:5))3 d(L(FDD, 1:5))4 
#> 0. 4714329 0. 6164542 0. 6748177 0. 7489835 0. 7852874 
#> d(L(FDD, 1:5))5 L(FDD, 6) 
#> 0. 8340436 0. 8842895 


As noted above, using a model specification as in (15.3) allows to easily obtain 
standard errors for the estimated dynamic cumulative multipliers. 


# obtain coefficient summary that reports HAC standard errors 


coeftest(cum_mult_reg, vcov. = vcovHAC) 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) -0. 69296 0.23668 -2.9278 0.0035431 ** 
#> d(FDD) 0.47143 0.13583 3.4709 0.0005562 *** 
#> a(L (EDD; 1 5)) 1" 10761625 0.13145 4.6896 3.395e-06 *** 
#2 d (LEDD TI-E) 21016782 0.16009 4.2151 2.882e-05 *** 
#> d(L(FDD, 1:5))3 0.74898 0.17263 4.3387 1.682e-05 *** 
#> d(L(FDD, 1:5))4 0.78529 0.17351 4.5260 7.255e-06 *** 
#> a (TEDD 1:5))5 0.83404 0.18236 4.5737 5.827e-06 *** 
#> L(FDD, 6) 0. 88429 0.19303 4.5810 5.634e-06 *** 
#> --- 

#> Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


15.4 HAC Standard Errors 


The error term wu; in the distributed lag model (15.2) may be serially correlated 
due to serially correlated determinants of Y; that are not included as regressors. 
When these factors are not correlated with the regressors included in the model, 
serially correlated errors do not violate the assumption of exogeneity such that 
the OLS estimator remains unbiased and consistent. 


However, autocorrelated standard errors render the usual homoskedasticity-only 
and heteroskedasticity-robust standard errors invalid and may cause misleading 
inference. HAC errors are a remedy. 
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Key Concept 15.2 
HAC Standard errors 


Problem: 
If the error term uw in the distributed lag model (15.2) is serially corre- 
lated, statistical inference that rests on usual (heteroskedasticity-robust) 


standard errors can be strongly misleading. 


Solution: 


Heteroskedasticity- and autocorrelation-consistent (HAC) estimators 
of the variance-covariance matrix circumvent this issue. There are 
R functions like vcovHAC() from the package sandwich which are 
convenient for computation of such estimators. 


The package sandwich also contains the function NeweyWest(), an im- 
plementation of the HAC variance-covariance estimator proposed by 
Newey and West (1987). 





Consider the distributed lag regression model with no lags and a single regressor 
Xt 


Yı = Bo + b1 Xi + ut. 


with autocorrelated errors. A brief derivation of 


A a2 F 
TB = 7B, fit (15.4) 


the so-called Newey-West variance estimator for the variance of the OLS es- 
timator of 3; is presented in Chapter 15.4 of the book. a5 in (15.4) is the 
L 


heteroskedasticity-robust variance estimate of Bi and 


m-l1 5 

= m — ep 

fi=1+25 (==) Dy (15.5) 
j=1 


is a correction factor that adjusts for serially correlated errors and involves 
estimates of m — 1 autocorrelation coefficients Pj As it turns out, using the 
sample autocorrelation as implemented in acf () to estimate the autocorrelation 
coefficients renders (15.4) inconsistent, see pp. 650-651 of the book for a detailed 
argument. Therefore, we use a somewhat different estimator. For a time series 
X we have 





~ Di j+1 VtUt—j > 
= J 5 with ô = (Xi — X Jâ. 


We implement this estimator in the function acf_c() below. 
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m in (15.5) is a truncation parameter to be chosen. A rule of thumb for choosing 
m is 


m= [0.75 ; pe . (15.6) 


We simulate a time series that, as stated above, follows a distributed lag model 
with autocorrelated errors and then show how to compute the Newey-West HAC 
estimate of SE(8ı) using R. This is done via two separate but, as we will see, 
identical approaches: at first we follow the derivation presented in the book 
step-by-step and compute the estimate “manually”. We then show that the 
result is exactly the estimate obtained when using the function NeweyWest (). 


# function that computes rho tilde 
acf c <- function(x, j) { 
return ( 
t(xf[-c(1:j)]) “%*% na.omit(Lag(x, j)) / t(x) %*% x 
) 
} 


# simulate time series with serially correlated errors 
set.seed(1) 


N <- 100 


eps <- arima.sim(n = N, model = list(ma = 0.5)) 
X <- runif(N, 1, 10) 
Y <- 0.5 * X + eps 


# compute OLS residuals 
res <- lm(Y ~ X)$res 


# compute v 
v <- (X - mean(X)) * res 


# compute robust estimate of beta_1 variance 
var_beta_hat <- 1/N * (1/(N-2) * sum((X - mean(X))*2 * res^2) ) / 
(1/N * sum((X - mean(X))^2))^2 


# rule of thumb truncation parameter 
m <- floor(0.75 * N*(1/3)) 


# compute correction factor 

f_hat_T <- 1 + 2 * sum( 
(m - 1:(m-1))/m * sapply(1:(m - 1), function(i) acf_c(x =v, j = i)) 
) 
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# compute Newey-West HAC estimate of the standard error 
sqrt (var_beta_hat * f_hat_T) 
#> [1] 0.04036208 


For the code to be reusable in other applications, we use sapply() to estimate 
the m — 1 autocorrelations Pj: 


# Using NeweyWest(): 

NW_VCOV <- NeweyWest(1m(Y ~ X), 
Tap -mi e prenbhitei=ik, 
adjust = T) 


# compute standard error 
sqrt (diag (NW_VCOV) ) [2] 

#> xX 

#> 0.04036208 


By choosing lag = m-1 we ensure that the maximum order of autocorrelations 
used is m — 1 — just as in equation (15.5). Notice that we set the arguments 
prewhite = F and adjust = T to ensure that the formula (15.4) is used and 
finite sample adjustments are made. 


We find that the computed standard errors coincide. Of course, a variance- 
covariance matrix estimate as computed by NeweyWest() can be supplied as 
the argument vcov in coeftest() such that HAC t-statistics and p-values are 
provided by the latter. 


example_mod <- 1m(Y ~ X) 
coeftest(example_mod, vcov = NW_VCOV) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 0.542310 0.235423 2.3036 0.02336 * 

#> X 0.423305 0.040362 10.4877 < 2e-16 *** 

#> --- 

Ho otgnty. codes: (0 AEA OF 001 CAA O07 0L Fal O05 Vo! O71! ed 


15.5 Estimation of Dynamic Causal Effects with 
Strictly Exogeneous Regressors 


In general, the errors in a distributed lag model are correlated which necessitates 
usage of HAC standard errors for valid inference. If, however, the assumption 
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of exogeneity (the first assumption stated in Key Concept 15.1) is replaced by 
strict exogeneity, that is 


E(ue| +, tt Xt, Xt-15-- .) = 0, 


more efficient approaches than OLS estimation of the coefficients are available. 
For a general distributed lag model with r lags and AR(p) errors, these ap- 
proaches are summarized in Key Concept 15.4. 


Key Concept 15.4 
Estimation of Dynamic Multipliers Under Strict Exogeneity 


Consider the general distributed lag model with r lags and errors follow- 
ing an AR(p) process, 


Y; = bo + BiXe + BoXe-1 +--+ Berti Xt—r + Ut 
Ug = Oy Ut-1 Ae putz Jo OUE ak Ut. 


Under strict exogeneity of X+, one may rewrite the above model in the 
ADL specification 


Y% =a0 + $1%4-1 + G2%-0 +--- + Opp 
te OG: HE Oa kya te oe sb Ogee F Ur 


where q = r + p and compute estimates of the dynamic multipliers 
b1, b2,- - -, Br41 using OLS estimates of $1, %2,- - - , Øp, 0, 01, - - - , Ôq- 


An alternative is to estimate the dynamic multipliers using feasible GLS, 
that is to apply the OLS estimator to a quasi-difference specification of 
(??). Under strict exogeneity, the feasible GLS approach is the BLUE 
estimator for the dynamic multipliers in large samples. 


On the one hand, as demonstrated in Chapter 15.5 of the book, OLS 
estimation of the ADL representation can be beneficial for estimation of 
the dynamic multipliers in large distributed lag models because it allows 
for a more parsimonious model that may be a good approximation to 
the large model. On the other hand, the GLS approach is more efficient 
than the ADL estimator if the sample size is large. 





We shortly review how the different representations of a small distributed lag 
model can be obtained and show how this specification can be estimated by 
OLS and GLS using R. 


The model is 
Yı = Bo + b1 Xt + BoXt-1 + ut, (15.9) 
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so a change in X is modeled to effect Y contemporaneously (61) and in the next 
period (62). The error term us is assumed to follow an AR(1) process, 


Ut = Q11 + Ut, 
where wt; is serially uncorrelated. 
One can show that the ADL representation of this model is 
Y; = ao + b1Yt-1 + ôo Xi + 61X1 + 62X¢_2 + Ut, (15.10) 

with the restrictions 

Bi = ðo, 

Bo =ô1 + 160, 
see p. 657 of the book. 


Quasi-Differences 


Another way of writing the ADL(1,2) representation (15.10) is the quasi- 
difference model 


Y, = ao + Xp + b2X i1 + ù, (15.11) 


where Y, = Y, — ġı Y;—ı and X: = X — ġ1ıXı—1. Notice that the error term Ut 
is uncorrelated in both models and, as shown in Chapter 15.5 of the book, 


E(ue|Xt41, Xt, Xt-1, - <) = 0, 
which is implied by the assumption of strict exogeneity. 


We continue by simulating a time series of 500 observations using the model 
(15.9) with 8; = 0.1, 82 = 0.25, ọ = 0.5 and u, ~ N(0,1) and estimate the 
different representations, starting with the distributed lag model (15.9). 


# set seed for reproducibility 
set.seed(1) 


# simulate a time series with serially correlated errors 
obs <- 501 

eps <- arima.sim(n = obs-1 , model = list(ar = 0.5)) 

X <- arima.sim(n = obs, model = list(ar = 0.25)) 

Y <- 0.1 * X[-1] + 0.25 * X[-obs] + eps 

X <- ts(X[-1]) 


# estimate the distributed lag model 
dlm <- dynlm(Y ~ X + L(X)) 
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Let us check that the residuals of this model exhibit autocorrelation using acf (). 


# check that the residuals are serially correlated 
acf (residuals (dl1m)) 


Series residuals(dlm) 





0.8 


ACF 
0.4 

















Lag 


In particular, the pattern reveals that the residuals follow an autoregressive 
process, as the sample autocorrelation function decays quickly for the first few 
lags and is probably zero for higher lag orders. In any case, HAC standard 
errors should be used. 


# coefficient summary using the Newey-West SE estimates 
coeftest(dlm, vcov = NeweyWest, prewhite = F, adjust = T) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 0.038340 0.073411 0.5223 0.601717 

#> X 0.123661 0.046710 2.6474 0.008368 ** 

#> L(X) 0.247406 0.046377 5.3347 1.458e-O7 *** 

#> --- 

H Signij- codes Of FER TON OON eet TOOT el SOROS es! Olde ad 


OLS Estimation of the ADL Model 


Next, we estimate the ADL(1,2) model (15.10) using OLS. The errors are un- 
correlated in this representation of the model. This statement is supported by 
a plot of the sample autocorrelation function of the residual series. 
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# estimate the ADL(2,1) representation of the distributed lag model 
ad121_dynamic <- dynlm(Y ~ L(Y) + X + L(X, 1:2)) 


# plot the sample autocorrelaltions of residuals 
acf(ad121_dynamic$residuals) 


Series adl21_dynamic$residuals 





ACF 
0.4 





0.0 











Lag 


The estimated coefficients of ad121_dynamic$coefficients are not the dy- 
namic multipliers we are interested in, but instead can be computed according 
to the restrictions in (15.10), where the true coefficients are replaced by the OLS 
estimates. 


# compute estimated dynamic effects using coefficient restrictions 
# in the ADL(2,1) representation 
t <- adl21_dynamic$coefficients 


c ("hat beta 1S thi 

"hat_beta_2" = t[4] + t[3] * t[2]) 
#> pat betar LX Pac betar 2 CO I-21 
#> 0. 1176425 0. 2478484 


GLS Estimation 


Strict exogeneity allows for OLS estimation of the quasi-difference model 
(15.11). The idea of applying the OLS estimator to a model where the variables 
are linearly transformed, such that the model errors are uncorrelated and 
homoskedastic, is called generalized least squares (GLS). 


The OLS estimator in (15.11) is called the infeasible GLS estimator because Y 
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and X cannot be computed without knowing ¢ , the autoregressive coefficient 
in the error AR(1) model, which is generally unknown in practice. 


Assume we knew that ¢ = 0.5. We then may obtain the infeasible GLS estimates 
of the dynamic multipliers in (15.9) by applying OLS to the transformed data. 


# GLS: estimate quasi-differenced specification by OLS 
iGLS_dynamic <- dynlm(I(Y- 0.5 * L(Y)) ~ I(X - 0.5 * L(X)) + I(L(X) - 0.5 * LQ, 2))) 


summary (iGLS_dynamic) 

#> 

#> Time series regression with "ts" data: 
#> Start = 3, End = 500 


#> 

#> Call: 

4> dynim formula = Y O E CY) = HO O25 4 OEE nC COT- 
#> 0.5 * L(X, 2))) 

#> 

#> Residuals: 

#> Min 1Q Median 20. Maz 

#> -3.0325 -0.6375 -0.0499 0.6658 3.7724 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 0.01620 0.04564 0.355 0.72273 

#> IX - 0.5 * L(X)) 0.12000 0.042387 2.832 0.00481 ** 
#> T(E CO — 015 * ECG 2)) (0.25266 0.04237 5.963 4.72e-09 *** 
#> --- 

#> Signif. codes: O '***' 0.001 '**' 0.01 '*' 0.05 '.' O.1 ' ' 1 
#> 

#> Residual standard error: 1.017 on 495 degrees of freedom 

#> Multiple R-squared: 0.07035, Adjusted R-squared: 0.0666 


#> F- statistic: 18.73 on 2 and 495 DF, p-value: 1.442e-08 


The feasible GLS estimator uses preliminary estimation of the coefficients in 
the presumed error term model, computes the quasi-differenced data and then 
estimates the model using OLS. This idea was introduced by Cochrane and 
Orcutt (1949) and can be extended by continuing this process iteratively. Such a 
procedure is implemented in the function cochrane.orcutt() from the package 
orcutt. 


Rite<= eh la19) 
# create first lag 
X_11 <- c(X[-500]) 
Yt < cO) 
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# iterated cochrane-orcutt procedure 
summary (cochrane.orcutt(1m(Y_t ~ X_t + X_11))) 


#> Call: 

#> Im(Cformula = Y_t ~ X_t + X_11) 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> (Intercept) 0.032885 0.085163 0.386 0.69956 

#> Xt 0.120128 0.042534 2.824 0.00493 ** 

#> X l1 0.252406 0.042538 5.934 5.572e-09 *** 

#> --- 

A> Signi- codes- O Crk TORO0T ERI ONOI. LR ONOS Ont! I 
#> 


#> Residual standard error: 1.0165 on 495 degrees of freedom 
#> Multiple R-squared: 0.0704 , Adjusted R-squared: 0.0666 
#> F-statistic: 18.7 on 2 and 495 DF, p-value: < 1.429e-08 
#> 

#> Durbin-Watson statistic 

#> (original): 106907 p-value 105e-25 

#> (transformed): 1.98192 , p-value: 4.246e-01 


Some more sophisticated methods for GLS estimation are provided with the 
package nlme. The function gls() can be used to fit linear models by maximum 
likelihood estimation algorithms and allows to specify a correlation structure for 
the error term. 


# feasible GLS maximum likelihood estimation procedure 
summary(gls(Y_t ~ X_t + X_11, correlation = corAR1())) 
#> Generalized least squares fit by REML 

#> Models VETS = Cie ae OG (bal 

#> Data: NULL 


> AIC BIC logLik 
#> 1451.847 1472.88 -720.9235 
#> 


#> Correlation Structure: AR(1) 
#> Formula: ~1 
#> Parameter estimate(s): 


#> Phi 

#> 0.4668343 

#> 

#> Coefficients: 

#> Value Std.Error t-value p-value 
#> (Intercept) 0.03929124 0.08530544 0.460595 0.6453 
#> Xt 0.11986994 0.04252270 2.818963 0.0050 
#> X l1 0. 25287471 0.04252497 5.946500 0.0000 


#> 
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#> Correlation: 


#> CEMETI AE 

#> X t 0.039 

Ho X VIO OST 10-2380 

#> 

#> Standardized residuals: 

#> Min Q1 Med Q3 Max 


#> -3.00075518 -0.64255522 -0.05400347 0.69101814 3.28555793 


#> Residual standard error: 1.14952 
#> Degrees of freedom: 499 total; 496 residual 


Notice that in this example, the coefficient estimates produced by GLS are 
somewhat closer to their true values and that the standard errors are the smallest 
for the GLS estimator. 


15.6 Orange Juice Prices and Cold Weather 


This section investigates the following two questions using the time series re- 
gression methods discussed here: 


e How persistent is the effect of a single freeze on orange juice concentrate 
prices? 


e Has the effect been stable over the whole time span? 


We start by estimating dynamic causal effects with a distributed lag model 
where %ChgOJC; is regressed on FDD; and 18 lags. A second model specifi- 
cation considers a transformation of the the distributed lag model which allows 
to estimate the 19 cumulative dynamic multipliers using OLS. The third model, 
adds 11 binary variables (one for each of the months from February to De- 
cember) to adjust for a possible omitted variable bias arising from correlation of 
FDD, and seasons by adding season(FDD) to the right hand side of the formula 
of the second model. 


# estimate distributed lag models of frozen orange juice price changes 


FOJC_mod_DM <- dynlm(FOJC_pctc ~ L(FDD, 0:18)) 
FOJC_mod_CM1 <- dynlm(FOJC_pctc ~ L(d(FDD), 0:17) + L(FDD, 18)) 


FOJC_mod_CM2 <- dynlm(FOJC_pctc ~ L(d(FDD), 0:17) + L(FDD, 18) + season(FDD)) 


The above models include a large number of lags with default labels that corre- 
spond to the degree of differencing and the lag orders which makes it somewhat 
cumbersome to read the output. The regressor labels of a model object may 
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be altered by overriding the attribute names of the coefficient section using the 
function attr(). Thus, for better readability we use the lag orders as regressor 
labels. 


# set lag orders as regressor labels 

attr (FOJC_mod_DM$coefficients, "names") [1:20] <- c("(Intercept)", as.character(0:18)) 
attr (FOJC_mod_CM1$coefficients, "names") [1:20] <- c("(Intercept)", as.character(0:18)) 
attr (FOJC_mod_CM2$coefficients, "names") [1:20] <- c("(Intercept)", as.character(0:18)) 


Next, we compute HAC standard errors standard errors for each model using 
NeweyWest() and gather the results in a list which is then supplied as the 
argument se to the function stargazer(), see below. The sample consists of 
612 observations: 


length (FDD) 
#> [1] 612 


According to (15.6), the rule of thumb for choosing the HAC standard error 
truncation parameter m, we choose 


m = [0.75 -6121/*] = [6.87] =7. 


To check for sensitivity of the standard errors to different choices of the trunca- 
tion parameter in the model that is used to estimate the cumulative multipliers, 
we also compute the Newey-West estimator for m = 14. 


# gather HAC standard error errors in a list 

SEs <- list (sqrt (diag (NeweyWest (FOJC_mod_DM, lag = 7, prewhite = F))), 
sqrt (diag (NeweyWest (FOJC_mod_CM1, lag = 7, prewhite = F))), 
sqrt (diag (NeweyWest (FOJC_mod_CM1, lag = 14, prewhite = F))), 
sqrt (diag (NeweyWest (FOJC_mod_CM2, lag = 7, prewhite = F)))) 


The results are then used to reproduce the outcomes presented in Table 15.1 of 
the book. 


stargazer(FOJC_mod_DM , FOJC_mod_CM1, FOJC_mod_CM1, FOJC_mod_CM2, 


title = "Dynamic Effects of a Freezing Degree Day on the Price of Orange Juice", 
header = FALSE, 

digits = 3, 

column.labels = c("Dynamic Multipliers", rep("Dynamic Cumulative Multipliers", 3)), 
dep.var.caption = "Dependent Variable: Monthly Percentage Change in Orange Juice Pr: 


dep.var.labels.include = FALSE, 
covariate.labels = as.character(0:18), 
omit = "season", 
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se = SEs, 

no.space = T, 

add.lines = list(c("Monthly indicators?","no", "no", "no", "yes"), 
CC USUNG) truncation SUA, U7 A0L SMa ali Icy yy 

omit-stat = c (Ursgu fe "ser'!))) 


According to column (1) of Table 15.1, the contemporaneous effect of a freezing 
degree day is an increase of 0.5% in orange juice prices. The estimated effect 
is only 0.17% for the next month and close to zero for subsequent months. In 
fact, for all lags larger than 1, we cannot reject the null hypotheses that the 
respective coefficients are zero using individual t-tests. The model FOJC_mod_DM 
only explains little of the variation in the dependent variable (R? = 0.11). 


Columns (2) and (3) present estimates of the dynamic cumulative multipliers of 
model FOJC_mod_CM1. Apparently, it does not matter whether we choose m = 7 
or m = 14 when computing HAC standard errors so we stick with m = 7 and 
the standard errors reported in column (2). 


If the demand for orange juice is higher in winter, FDD, would be correlated 
with the error term since freezes occur rather in winter so we would face omit- 
ted variable bias. The third model estimate, FOJC_mod_CM2, accounts for this 
possible issue by using an additional set of 11 monthly dummies. For brevity, 
estimates of the dummy coefficients are excluded from the output produced by 
stargazer (this is achieved by setting omit = ’season’). We may check that 
the dummy for January was omitted to prevent perfect multicollinearity. 


# estimates on mothly dummies 
FOJC_mod_CM2$coefficients[-c(1:20)] 
#> season(FDD)Feb season(FDD)Mar season(FDD)Apr season(FDD)May season(FDD) Jun 


#> -0,9565759 -0. 6358007 0. 5006770 -1.0801764 0.3195624 
#> season(FDD) Jul season(FDD)Aug season(FDD)Sep season(FDD)O0ct season(FDD)Nov 
#> 0.1951113 0. 3644312 -0.4130969 -0.1566622 0.3116534 
#> season(FDD)Dec 
#> 0. 1481589 


A comparison of the estimates presented in columns (3) and (4) indicates that 
adding monthly dummies has a negligible effect. Further evidence for this comes 
from a joint test of the hypothesis that the 11 dummy coefficients are zero. 
Instead of using linearHypothesis(), we use the function waldtest() and 
supply two model objects instead: unres_model, the unrestricted model object 
which is the same as FOJC_mod_CM2 (except for the coefficient names since we 
have modified them above) and res_model, the model where the restriction that 
all dummy coefficients are zero is imposed. res_model is conveniently obtained 
using the function update(). It extracts the argument formula of a model 
object, updates it as specified and then re-fits the model. By setting formula 
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Table 15.1: Dynamic Effects of a Freezing Degree Day on the Price of Orange Juice 








Dependent Variable: Monthly Percentage Change in Orange Juice Price 




















Dyn. Mult. Dyn. Cum. Mult. Dyn. Cum. Mult. Dyn. Cum. Mult. 
(1) (2) (3) (4) 

lag 0 0.508*** (0.137) 0.508*** (0.137) 0.508*** (0.139) 0.524*** (0.142) 
lag 1 0.172** (0.088) 0.680*** (0.134) 0.680*** (0.130) 0.720*** (0.142) 
lag 2 0.068 (0.060) 0.748*** (0.165) 0.748*** (0.162) 0.781*** (0.173) 
lag 3 0.070 (0.044) 0.819*** (0.181) 0.819*** (0.181) 0.861*** (0.190) 
lag 4 0.022 (0.031) 0.841*** (0.183) 0.841*** (0.184) 0.892*** (0.194) 
lag 5 0.027 (0.030) 0.868*** (0.189) 0.868*** (0.189) 0.904*** (0.199) 
lag 6 0.031 (0.047) 0.900*** (0.202) 0.900*** (0.208) 0.922*** (0.210) 
lag 7 0.015 (0.015) 0.915*** (0.205) 0.915*** (0.210) 0.939*** (0.212) 
lag 8 —0.042 (0.034) 0.873*** (0.214) 0.873*** (0.218) 0.904*** (0.219) 
lag 9 —0.010 (0.051) 0.862*** (0.236) 0.862*** (0.245) 0.884*** (0.239) 
lag 10 —0.116* (0.069) 0.746*** (0.257) 0.746*** (0.262) 0.752*** (0.259) 
lag 11 —0.067 (0.052) 0.680** (0.266) 0.680** (0.272) 0.677** (0.267) 
lag 12 —0.143* (0.076) 0.537** (0.268) 0.537** (0.271) 0.551** (0.272) 
lag 13 —0.083* (0.043) 0.454* (0.267) 0.454* (0.273) 0.491* (0.275) 
lag 14 —0.057 (0.035) 0.397 (0.273) 0.397 (0.284) 0.427 (0.278) 
lag 15 —0.032 (0.028) 0.366 (0.276) 0.366 (0.287) 0.406 (0.280) 
lag 16 —0.005 (0.055) 0.360 (0.283) 0.360 (0.293) 0.408 (0.286) 
lag 17 0.003 (0.018) 0.363 (0.287) 0.363 (0.294) 0.395 (0.290) 
lag 18 0.003 (0.017) 0.366 (0.293) 0.366 (0.301) 0.386 (0.295) 
Constant —0.343 (0.269) —0.343 (0.269) —0.343 (0.256) —0.241 (0.934) 
Monthly indicators? no no no yes 
HAC truncation 7 7 14 7 
Observations 594 594 594 594 
Adjusted R? 0.109 0.109 0.109 0.101 
Note: *p<0.1; **p<0.05; ***p<0.01 
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=. ~ . — season(FDD) we impose that the monthly dummies do not enter 
the model. 


# test if coefficients on monthly dummies are zero 
unres_model <- dynlm(FOJC_pctc ~ L(d(FDD), 0:17) + L(FDD, 18) + season(FDD) ) 


res_model <- update(unres_model, formula = . ~ . - season(FDD)) 


waldtest(unres_model, 
res_model, 
vcov = NeweyWest(unres_model, lag = 7, prewhite = F)) 
#> Wald test 
#> 
#> Model 1: FOJC_pctc ~ L(d(FDD), 0:17) + L(FDD, 18) + season(FDD) 
#> Model 2: FOJC_pctc ~ L(d(FDD), 0:17) + L(FDD, 18) 
#> Res.Df Df F Pr(>F) 
#> 1 563 
#> 2 674 =11 059683 OL743 


The p-value is 0.47 so we cannot reject the hypothesis that the coefficients on 
the monthly dummies are zero, even at the 10% level. We conclude that the 
seasonal fluctuations in demand for orange juice do not pose a serious threat to 
internal validity of the model. 


It is convenient to use plots of dynamic multipliers and cumulative dynamic 
multipliers. The following two code chunks reproduce Figures 15.2 (a) and 
15.2 (b) of the book which display point estimates of dynamic and cumulative 
multipliers along with upper and lower bounds of their 95% confidence intervals 
computed using the above HAC standard errors. 


# 95% CI bounds 
point_estimates <- FOJC_mod_DM$coefficients 


CI_bounds <- cbind("lower" = point_estimates - 1.96 * SEs[[1]], 
"upper" = point_estimates + 1.96 * SEs[[1]])[-1, ] 


# plot the estimated dynamic multipliers 
plot(0:18, point_estimates[-1], 


type = "1", 

lwd = 2, 

col = "steelblue", 
ylim = c(-0.4, 1), 
zlab = suhag, 


ylab = "Dynamic multiplier", 
main = "Dynamic Effect of FDD on Orange Juice Price") 
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# add a dashed line at O 
abline(h = 0, lty = 2) 


# add CI bounds 
lines(0:18, CI_bounds[,1], col 
lines(0:18, CI_bounds[,2], col 


"darkred") 
"darkred") 


Dynamic Effect of FDD on Orange Juice Price 
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Figure 15.1: Dynamic Multipliers 


The 95% confidence intervals plotted in Figure 15.1 indeed include zero for lags 
larger than 1 such that the null of a zero multiplier cannot be rejected for these 
lags. 


# 95% CI bounds 
point_estimates <- FO0JC_mod_CM1$coefficients 


CI_bounds <- cbind("lower" = point_estimates - 1.96 * SEs[[2]], 
"upper" = point_estimates + 1.96 * SEs[[2]])[-1,] 


# plot estimated dynamic multipliers 
plot(0:18, point_estimates[-1], 


type = "1", 

lwd = 2, 

col = "steelblue", 

ylim = c(-0.4, 1.6), 

xliabi = "Lag", 

ylab = "Cumulative dynamic multiplier", 


main = "Cumulative Dynamic Effect of FDD on Orange Juice Price") 


15.6. ORANGE JUICE PRICES AND COLD WEATHER 467 


# add dashed line at O 
abline(h = 0, lty = 2) 


# add CI bounds 
lines(0:18, CI_bounds[, 1], col 
lines(0:18, CI_bounds[, 2], col 


"darkred") 
"darkred") 


Cumulative Dynamic Effect of FDD on Orange Juice Price 
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Figure 15.2: Dynamic Cumulative Multipliers 


As can be seen from Figure 15.2, the estimated dynamic cumulative multipliers 
grow until the seventh month up to a price increase of about 0.91% and then 
decrease slightly to the estimated long-run cumulative multiplier of 0.37% which, 
however, is not significantly different from zero at the 5% level. 


Have the dynamic multipliers been stable over time? One way to see this is 
to estimate these multipliers for different subperiods of the sample span. For 
example, consider periods 1950 - 1966, 1967 - 1983 and 1984 - 2000. If the 
multipliers are the same for all three periods the estimates should be close and 
thus the estimated cumulative multipliers should be similar, too. We investigate 
this by re-estimating FOJC_mod_CM1 for the three different time spans and then 
plot the estimated cumulative dynamic multipliers for the comparison. 


# estimate cumulative multiplieres using different sample periods 
FOJC_mod_CM1950 <- update(FOJC_mod_CM1, start = c(1950, 1), end = 


FOJC_mod_CM1967 <- update(FOJC_mod_CM1, start 


c(1967, 1), end 


FOJC_mod_CM1984 <- update(FOJC_mod_CM1, start 


c(1984, 1), end 


c(1966, 12)) 
c(1983, 12)) 


c(2000, 12)) 
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# plot estimated dynamic cumulative multipliers (1950-1966) 
plot(0:18, FOJC_mod_CM1950$coefficients[-1], 


type = "]" : 
lwd = 2, 
col = "steelblue", 


slam = en 20). 
ylim = c(-0.5, 2), 


xlab = “ragi, 
ylab = "Cumulative dynamic multiplier", 
main = "Cumulative Dynamic Effect for Different Sample Periods") 


# plot estimated dynamic multipliers (1967-1983) 


lines(0:18, FOJC_mod_CM1967$coefficients[-1], lwd = 2) 


# plot estimated dynamic multipliers (1984-2000) 
lines(0:18, FOJC_mod_CM1984$coefficients[-1], lwd = 2, col = "darkgreen") 


# add dashed line at O 
abline(h = 0, lty = 2) 

# add annotations 

text(18, -0.24, "1984 - 2000") 
text (18 0.6, “1967 = 1983) 
text Cle r 12 O50. — 19661 


Cumulative Dynamic Effect for Different Sample Periods 


1950 — 1966 





1.5 











Cumulative dynamic multiplier 





Lag 


Clearly, the cumulative dynamic multipliers have changed considerably over 
time. The effect of a freeze was stronger and more persistent in the 1950s and 
1960s. For the 1970s the magnitude of the effect was lower but still highly persis- 
tent. We observe an even lower magnitude for the final third of the sample span 
(1984 - 2000) where the long-run effect is much less persistent and essentially 
zero after a year. 
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A QLR test for a break in the distributed lag regression of column (1) in Table 
@ref{tab:deoafddotpooj} with 15% trimming using a HAC variance-covariance 
matrix estimate supports the conjecture that the population regression coeffi- 
cients have changed over time. 


# set up a range of possible break dates 

tau <- c(window(time(FDD) , 
time (FDD) [round(612/100*15)], 
time (FDD) [round(612/100*85)])) 


# initialize the vector of F-statistics 
Fstats <- numeric(length(tau) ) 


# the restricted model 
res_model <- dynlm(FOJC_pctc ~ L(FDD, 0:18)) 


# estimation, loop over break dates 
for(i in 1:length(tau)) { 


# set up dummy variable 
D <- time(FOJC_pctc) > tau[i] 


# estimate DL model with intercations 
unres_model <- dynlm(FOJC_pctc ~ D * L(FDD, 0:18)) 


# compute and save F-statistic 
Fstats[i] <- waldtest(res_model, 
unres_model, 
vcov = NeweyWest(unres_model, lag = 7, prewhite = F))$F([2] 


Note that this code takes a couple of seconds to run since a total of length (tau) 
regressions with 40 model coefficients each are estimated. 


# QLR test statistic 
max(Fstats) 
#> [1] 36.76819 


The QLR statistic is 36.77. From Table 14.5 of the book we see that the 1% 
critical value for the QLR test with 15% trimming and q = 20 restrictions is 
2.43. Since this is a right-sided test, the QLR statistic clearly lies in the region 
of rejection so we can discard the null hypothesis of no break in the population 
regression function. 


See Chapter 15.7 of the book for a discussion of empirical examples where it 
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is questionable whether the assumption of (past and present) exogeneity of the 
regressors is plausible. 


Summary 


e We have seen how R can be used to estimate the time path of the effect 
on Y of a change in X (the dynamic causal effect on Y of a change in 
X) using time series data on both. The corresponding model is called the 
distributed lag model. Distributed lag models are conveniently estimated 
using the function dynlm() from the package dyn1m. 


e The regression error in distributed lag models is often serially correlated 
such that standard errors which are robust to heteroskedasticity and 
autocorrelation should be used to obtain valid inference. The package 
sandwich provides functions for computation of so-called HAC covariance 
matrix estimators, for example vcovHAC() and NeweyWest(). 


e When X is strictly exogeneous, more efficient estimates can be obtained 
using an ADL model or by GLS estimation. Feasible GLS algorithms can 
be found in the R packages orcutt and nlme. Chapter 15.7 of the book 
emphasizes that the assumption of strict exogeneity is often implausible 
in empirical applications. 


Chapter 16 


Additional Topics in Time 
Series Regression 


This chapter discusses the following advanced topics in time series regression 


and demonstrates how core techniques can be applied using R: 


e Vector autoregressions (VARs). We focus on using VARs for forecasting. 
Another branch of the literature is concerned with so-called Structural 


VARs which are, however, beyond the scope of this chapter. 


e Multiperiod forecasts. This includes a discussion of iterated and direct 


(multivariate) forecasts. 


e The DF-GLS test, a modification of the ADF test that has more power 
than the latter when the series has deterministic components and is close 


to being nonstationarity. 


e Cointegration analysis with an application to short- and long-term interest 
rates. We demonstrate how to estimate a vector error correction model. 

e Autoregressive conditional heteroskedasticity (ARCH) models. We show 
how a simple generalized ARCH (GARCH) model can be helpful in quan- 
tifying the risk associated with investing in the stock market in terms of 


estimation and forecasting of the volatility of asset returns. 


To reproduce the code examples, install the R packages listed below and make 


sure that the subsequent code chunk executes without any errors. 


e AER (Kleiber and Zeileis, 2020) 

e dynlm (Zeileis, 2019) 

e fGarch (Wuertz et al., 2020) 

e quantmod (Ryan and Ulrich, 2020) 

e readxl (Wickham and Bryan, 2019) 
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e scales (Wickham and Seidel, 2020) 
e vars (Pfaff, 2018) 


library (AER) 
library (readx1) 
library (dyn1m) 
library (vars) 
library (quantmod) 
library (scales) 
library (fGarch) 


16.1 Vector Autoregressions 


A Vector autoregressive (VAR) model is useful when one is interested in pre- 
dicting multiple time series variables using a single model. At its core, the VAR 
model is an extension of the univariate autoregressive model we have dealt with 
in Chapters 14 and 15. Key Concept 16.1 summarizes the essentials of VAR. 


Key Concept 16.1 
Vector Autoregressions 


The vector autoregression (VAR) model extends the idea of univariate 
autoregression to k time series regressions, where the lagged values of all 
k series appear as regressors. Put differently, in a VAR model we regress 
a vector of time series variables on lagged vectors of these variables. As 
for AR(p) models, the lag order is denoted by p so the VAR(p) model of 
two variables X, and Y; (k = 2) is given by the equations 


Ya = Pigs Crist or ip ion tr et iy eer Ue, 
DG = [Bq a Bow ean Je 00° se (yam te Vaa te 8? 8 IP yp Sew IP Uae 


The $s and ys can be estimated using OLS on each equation. The 
assumptions for VARs are the time series assumptions presented in Key 
Concept 14.6 applied to each of the equations. 


It is straightforward to estimate VAR models in R. A feasible approach 
is to simply use 1m() for estimation of the individual equations. Fur- 
thermore, the R package vars provides standard tools for estimation, 
diagnostic testing and prediction using this type of models. 





When the assumptions of Key Concept 16.1 hold, the OLS estimators of the 
VAR coefficients are consistent and jointly normal in large samples so that the 
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usual inferential methods such as confidence intervals and t-statistics can be 
used. 


The structure of VARs also allows to jointly test restrictions across multiple 
equations. For instance, it may be of interest to test whether the coefficients on 
all regressors of the lag p are zero. This corresponds to testing the null that the 
lag order p—1 is correct. Large sample joint normality of the coefficient estimates 
is convenient because it implies that we may simply use an F-test for this testing 
problem. The explicit formula for such a test statistic is rather complicated 
but fortunately such computations are easily done using the R functions we 
work with in this chapter. Another way to determine optimal lag lengths are 
information criteria like the BIC which we have introduced for univariate time 
series regressions in Chapter 14.6. Just as in the case of a single equation, for 
a multiple equation model we choose the specification which has the smallest 
BIC (p), where 


log(T) 
A 





BIC(p) = log [aet(S.)] + k(kp + 1) 


with Su denoting the estimate of the k x k covariance matrix of the VAR errors 
and det(-) denotes the determinant. 


As for univariate distributed lag models, one should think carefully about vari- 
ables to include in a VAR, as adding unrelated variables reduces the forecast 
accuracy by increasing the estimation error. This is particularly important 
because the number of parameters to be estimated grows qudratically to the 
number of variables modeled by the VAR. In the application below we shall see 
that economic theory and empirical evidence are helpful for this decision. 


A VAR Model of the Growth Rate of GDP and the Term Spread 


We now show how to estimate a VAR model of the GDP growth rate, GDPGR, 
and the term spread, TS'pread. As following the discussion on nonstationarity 
of GDP growth in Chapter 14.7 (recall the possible break in the early 1980s 
detected by the QLR test statistic), we use data from 1981:Q1 to 2012:Q4. The 
two model equations are 


GDPGR;, = Bio a Bu GDPGRi1 T Bi2GDPGRi-2 T yııTSpread—ı oe y12T Spread,_2 q 
T Spread, = B20 a B2,GDPGRi_1 TE B22GDPGRi_2 Tr yo1T Spread,_1 or y22T Spreadt—2 q 














The data set us_macro_quarterly.xlsx is provided on the companion website 
to Stock and Watson (2015) and can be downloaded here. It contains quarterly 
data on U.S. real (i.e., inflation adjusted) GDP from 1947 to 2004. We begin 
by importing the data set and do some formatting (we already worked with this 
data set in Chapter 14 so you may skip these steps if you have already loaded 
the data in your working environment). 


r Ult, 





r U2t- 
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# load the U.S. macroeconomic data set 
USMacroSWQ <- read_xlsx("Data/us_macro_quarterly.xlsx", 
sheet = 1, 
col_types = c("text", rep("numeric", 9))) 


# set the column names 
colnames(USMacroSWQ) <- c("Date", "GDPC96", "JAPAN_IP", "PCECTPI", "GS10", 
NES TB SMS CUNRATE Me  EXUSUK es OCPPAUCSL") 


# format the date column 
USMacroSWQ$Date <- as.yearqtr(USMacroSWQ$Date, format = "%Y:0%q") 


# define GDP as ts object 

GDP <- ts(USMacroSWQ$GDPC96, 
start = c(1957, 1), 
end = c(2013, 4), 
frequency = 4) 


# define GDP growth as a ts object 

GDPGrowth <- ts(400*log(GDP[-1]/GDP[-length(GDP)]), 
starti- iG GL Diem) 
end = c(2013, 4), 
frequency = 4) 


# 3-months Treasury bill interest rate as a 'ts' object 
TB3MS <- ts(USMacroSWQ$TB3MS , 

start = c(1957, 1), 

end = c(2013, 4), 

frequency = 4) 


# 10-years Treasury bonds interest rate as a 'ts' object 
TB10YS <- ts(USMacroSwQ$GS10, 

start = cA957, D7 

end = c(2013, 4), 

frequency = 4) 


# generate the term spread series 
TSpread <- TB10YS - TB3MS 


We estimate both equations separately by OLS and use coeftest() to obtain 
robust standard errors. 


# Estimate both equations using 'dynlm()' 
VAR_EQ1 <- dynlm(GDPGrowth ~ L(GDPGrowth, 1:2) + L(TSpread, 1:2), 
start = c(1981, 1), 
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end = c(2012, 4)) 


VAR_EQ2 <- dynlm(TSpread ~ L(GDPGrowth, 1:2) + L(TSpread, 1:2), 
start = C981 1D); 
end = c(2012, 4)) 


# rename regressors for better readability 
names (VAR_EQ1$coefficients) <- c("Intercept","Growth_t-1", 

"Growth_t-2", "TSpread_t-1", "TSpread_t-2") 
names (VAR_EQ2$coefficients) <- names(VAR_EQ1$coefficients) 


# robust coefficient summaries 
coeftest(VAR_EQ1, vcov. = sandwich) 


#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> Intercept 0.516344 0.524429 0.9846 0.3267616 

#> Growth_t-1 0.289553 07110827 T2612 7 070101038% 
#> Growth_t-2 0.216392 0.085879 2.5197 0.0130255 * 
#> TSpread_t-1 -0.902549 0.358290 -2.5190 0.0130498 * 
#> TSpread_t-2 1.329831 0.392660 3.3867 0.0009503 *** 
#> --- 

#> Signij- codes | OM FEES O 00T MAk AO OTEO OS i. (Old ad 
coeftest(VAR_EQ2, vcov. = sandwich) 

#> 

#> t test of coefficients: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> Intercept 0.4557740 0.1214227 3.7536 0.0002674 *** 


#> Growth_t-1 0.0099785 0.0218424 0.4568 0.6485920 

#> Growth_t-2 -0.0572451 0.0264099 -2.1676 0.0321186 * 

#> TSpread_t-1 1.0582279 0.0983750 10.7571 < 2.2e-16 *** 

#> TSpread_t-2 -0.2191902 0.1086198 -2.0180 0.0457712 * 

#> --- 

#> Signif. codes: O '***' 0.001 '**! 0.01 '*' 0.05 '.' 0.1 ' ' 1 


We end up with the following results: 


GDPGR, = 52 + + 0. ri GDPGRi-1 + ae .22GDPGRi_2 — es 90 T Spreadt—ı + ae 33 T Spreadi—2 


46) (0.1 





T Spread; = 0.46 + 0. GDPGR-ı — 0.06 GDPGRi-2 + a .06 T Spread,_1 — ee .22 T Spread,_2 


(0.12) (0.02) (0.03) 


The function VAR() can be used to obtain the same coefficient estimates as 
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presented above since it applies OLS per equation, too. 


# set up data for estimation using `VARO ` 


VAR_data <- window(ts.union(GDPGrowth, TSpread), start = c(1980, 3), end = c(2012, 4)) 


# estimate model coefficients using `VARO ` 

VAR_est <- VAR(y = VAR_data, p = 2) 

VAR_est 

#> 

#> VAR Estimation Results: 

#> sssss=sss=s2=5==s========= 

#> 

#> Estimated coefficients for equation GDPGrowth: 

#> S522222 

#> Call: 

#> GDPGrowth = GDPGrowth.11 + TSpread.li + GDPGrowth.l2 + TSpread.l2 + const 
#> 

#> GDPGrouwth. l1 TSpread.li GDPGrowth.l2 TSpread.l2 const 
#> 0.2895533 -0. 9025493 072163919 1. 3298305 0. 5163440 
#> 

#> 

#> Estimated coefficients for equation TSpread: 
Ee 

#> Call: 

#> TSpread = GDPGrowth.l1 + TSpread.l1 + GDPGrowth.l2 + TSpread.l2 + const 
#> 

#> GDPGrowth. l1 TSpread. L1 GDPGrowth.l2 TSpread.l2 const 
#> 0.009978489 1.058227945 -0.057245123 -0.219190243 0.455773969 


VAR() returns a list of 1m objects which can be passed to the usual functions, 
for example summary() and so it is straightforward to obtain model statistics 
for the individual equations. 


# obtain the adj. R°2 from the output of VARO" 
summary (VAR_est$varresult$GDPGrowth) $adj.r.squared 
#> [1] 0.2887223 

summary (VAR_est$varresult$TSpread) $adj.r.squared 
#> [1] 08254311 


We may use the individual model objects to conduct Granger causality tests. 


# Granger causality tests: 


# test if term spread has no power in explaining GDP growth 
linearHypothesis(VAR_EQ1, 
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hypothesis.matrix = c("TSpread_t-1", "TSpread_t-2"), 
vcov. = sandwich) 

#> Linear hypothesis test 

#> 

#> Hypothesis: 

#> Tspread_t — 0 

#> TSpread_t - 2 = 0 

#> 

#> Model 1: restricted model 

#> Model 2: GDPGrowth ~ L(GDPGrowth, 1:2) + L(TSpread, 1:2) 

#> 

#> Note: Coefficient covariance matrix supplied. 


#> 
#> Res.Df Df F Pr(>F) 

#1 125 

#> 2 123 2 5.9094 0.003544 ** 

p 

#> Signif. codes: 0O '*x*' 0.001 '**' 0.01 '*' 0.05 '.'0.1!'!1 


# test if GDP growth has no power in explaining term spread 
linearHypothesis(VAR_EQ2, 
hypothesis.matrix = c("Growth_t-1", "Growth_t-2"), 
vcov. = sandwich) 
#> Linear hypothesis test 


#> Hypothesis: 
#> Growth_t - O 
#> Growth_t - 2 = 0 


#> Model 1: restricted model 
#> Model 2: TSpread ~ L(GDPGrowth, 1:2) + L(TSpread, 1:2) 


#> Note: Coefficient covariance matrix supplied. 


#> 

#> Res.Df Df F Pr(>F) 

#> 1 125 

#> 2 123 23.4777 0.03395 * 

#> --- 

#> otgntf. codes (O00 '***! (0001 Uke! 0701 Ie! O05 Ve Ont bl a 


Both Granger causality tests reject at the level of 5%. This is evidence in favor 
of the conjecture that the term spread has power in explaining GDP growth 
and vice versa. 
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Iterated Multivariate Forecasts using an Iterated VAR 


The idea of an iterated forecast for period T + 2 based on observations up to 
period T is to use the one-period-ahead forecast as an intermediate step. That 
is, the forecast for period T + 1 is used as an observation when predicting the 
level of a series for period T + 2. This can be generalized to a h-period-ahead 
forecast where all intervening periods between T and T + h must be forecasted 
as they are used as observations in the process (see Chapter 16.2 of the book 
for a more detailed argument on this concept). Iterated multiperiod forecasts 
are summarized in Key Concept 16.2. 


Key Concept 16.2 
Iterated Multiperiod Forecasts 


The steps for an iterated multiperiod AR forecast are: 


. Estimate the AR(p) model using OLS and compute the one-period- 
ahead forecast. 


. Use the one-period-ahead forecast to obtain the two-period-ahead 
forecast. 


. Continue by iterating to obtain forecasts farther into the future. 


An iterated multiperiod VAR forecast is done as follows: 


1. Estimate the VAR(p) model using OLS per equation and compute 
the one-period-ahead forecast for all variables in the VAR. 


2. Use the one-period-ahead forecasts to obtain the two-period-ahead 
forecasts. 


3. Continue by iterating to obtain forecasts of all variables in the VAR 
farther into the future. 





Since a VAR models all variables using lags of the respective other variables, we 
need to compute forecasts for all variables. It can be cumbersome to do so when 
the VAR is large but fortunately there are R functions that facilitate this. For 
example, the function predict() can be used to obtain iterated multivariate 
forecasts for VAR models estimated by the function VAR(). 


The following code chunk shows how to compute iterated forecasts for GDP 
growth and the term spread up to period 2015:Q1, that is h = 10, using the 
model object VAR_est. 
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# compute iterated forecasts for GDP growth and term spread for the next 10 quarters 
forecasts <- predict (VAR_est) 


forecasts 

#> $GDPGrowth 

#> fest lower upper CI 

#> [1,] 1.738653 -3.006124 6.483430 4- T44ITT 

#> 2 1692199 = 3 31213110 0971181500/2925 

#> [3,] 1.911852 -3.282880 7.106583 5.194731 

K Pan Ba NEUI ES VEU VAAIA O 
#5 2.829067 T3 0417385 7. T00 76957371102 

#> [6,] 2.496815 -2.931819 7.925449 5.428634 

#> [7,] 2.631849 -2.846390 8.110088 5.478239 

#> [8,] 2.734819 -2.'785426 8.255064 5.520245 

#> [9,] 2.808291 -2.745597 8.362180 5.553889 

#> [10,] (23856169) -2.722905 8.435243 5.579074 

#> 

#> $TSpread 

#> Jest lower upper CI 
#> [1,] 1.676746 0.708471226 2.645021 0.9682751 
#> [2,] 1.884098 0.471880228 3.296316 1.4122179 
#> [3,] 1.999409 0.336348101 3.662470 1.6630609 
#> [4,] 2.080836 0.242407507 3.919265 1.8384285 
#> [5] 2.131402 (0. 175797245 4.037003 1. 9556052 
#> [6,] 2.156094 0.125220562 4.186968 2.0308738 
#> [7,] 2.161783 0.085037834 4.238528 2.0767452 
#> [8,] 2.154170 0.051061544 4.257278 2.1031082 
#> [9,] 2.138164 0.020749780 4.255578 2.1174139 
#> [10,] 2.117733 -0.007139213 4.242605 2.1248722 


This reveals that the two-quarter-ahead forecast of GDP growth in 2013:Q2 
using data through 2012:Q4 is 1.69. For the same period, the iterated VAR 
forecast for the term spread is 1.88. 


The matrices returned by predict (VAR_est) also include 95% prediction inter- 
vals (however, the function does not adjust for autocorrelation or heteroskedas- 
ticity of the errors!). 


We may also plot the iterated forecasts for both variables by calling plot () on 
the output of predict (VAR_est). 


# visualize the iterated forecasts 
plot (forecasts) 
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Direct Multiperiod Forecasts 


A direct multiperiod forecast uses a model where the predictors are lagged 
appropriately such that the available observations can be used directly to do 
the forecast. The idea of direct multiperiod forecasting is summarized in Key 
Concept 16.3. 
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Key Concept 16.3 
Direct Multiperiod Forecasts 


A direct multiperiod forecast that forecasts h periods into the future using 
a model of Y, and an additional predictor X+, with p lags is done by first 
estimating 


% =o + Oia the oo E O io F OA 
Sue ae Onn apes + Ut, 





which is then used to compute the forecast of Yr+;, based on observations 
through period T. 





For example, to obtain two-quarter-ahead forecasts of GDP growth and the 
term spread we first estimate the equations 


GDPGR, = Bio + b11 GDPGR;-2 + Bi2GDPGR,_3 + 1T Spready_2 + y12T Spready_3 + ure, 
T Spread, = B20 Tr B2,GDPGRi_2 = B22GDPGRi_3 Tr yoiT Spread:_—2 T y22T Spreadı—3 T U2t 

















and then substitute the values of GDPGRo012.94, GDPGR2012:Q3, 
TSpreadzo12:94 and TSpreadz012:Q3 into both equations. This is easily 
done manually. 


# estimate models for direct two-quarter-ahead forecasts 
VAR_EQ1_direct <- dynlm(GDPGrowth ~ L(GDPGrowth, 2:3) + L(TSpread, 2:3), 
start = c(1981, 1), end = c(2012, 4)) 


VAR_EQ2_ direct <- dynlm(TSpread ~ L(GDPGrowth, 2:3) + L(TSpread, 2:3), 
start = c(1981, 1), end = c(2012, 4)) 


# compute direct two-quarter-ahead forecasts 

coef (VAR_EQ1_direct) %*% c(1, # intercept 
window(GDPGrowth, start = c(2012, 3), end = c(2012, 4)), 
window(TSpread, start = c(2012, 3), end = c(2012, 4))) 


#> [1] 
FOL 27739297 


coef (VAR_EQ2_direct) %*/, c(1, # intercept 
window(GDPGrowth, start = c(2012, 3), end = c(2012, 4)), 
window(TSpread, start = c(2012, 3), end = c(2012, 4))) 
#> etal 
#> 1] 166578 


Applied economists often use the iterated method since this forecasts are more 
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reliable in terms of MSF'E, provided that the one-period-ahead model is cor- 
rectly specified. If this is not the case, for example because one equation in a 
VAR is believed to be misspecified, it can be beneficial to use direct forecasts 
since the iterated method will then be biased and thus have a higher MSFE 
than the direct method. See Chapter 16.2 for a more detailed discussion on 
advantages and disadvantages of both methods. 


16.2 Orders of Integration and the DF-GLS 
Unit Root Test 


Some economic time series have smoother trends than variables that can be 
described by random walk models. A way to model these time series is 


AY; = Bo + AY:-1 + ue, 


where u, is a serially uncorrelated error term. This model states that the first 
difference of a series is a random walk. Consequently, the series of second 
differences of Y; is stationary. Key Concept 16.4 summarizes the notation. 


Key Concept 16.4 
Orders of Integration, Differencing and Stationarity 


e When a time series Y, has a unit autoregressive root, Y, is inte- 
grated of order one. This is often denoted by Y, ~ I(1). We 
simply say that Y; is [(1). If Y, is I(1), its first difference AY; is 
stationary. 


Y; is [(2) when Y; needs to be differenced twice in order to obtain 
a stationary series. Using the notation introduced here, Y; is I(2), 


its first difference AY; is I(1) and its second difference A?Y; is 
stationary. Y, is I(d) when Y, must be differenced d times to obtain 
a stationary series. 


e When Y; is stationary, it is integrated of order 0 so Y; is I(0). 


It is fairly easy to obtain differences of time series in R. For example, 
the function diff() returns suitably lagged and iterated differences of 
numeric vectors, matrices and time series objects of the class ts. 





Following the book, we take the price level of the U.S. measured by the Personal 
Consumption Expenditures Price Index as an example. 
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# define ts object of the U.S. PCE Price Index 
PCECTPI <- ts(log(USMacroSWQ$PCECTPI) , 

start = c(1957, 1), 

end = c(2012, 4), 

freq = 4) 


# plot logarithm of the PCE Price Index 
plot (log(PCECTPTI) , 
main = Log of United States PCE Price: Index ™, 
ylab = "Logarithm", 
col = "steelblue", 
lwd = 2) 
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The logarithm of the price level has a smoothly varying trend. This is typical for 
an I(2) series. If the price level is indeed (2), the first differences of this series 
should be I(1). Since we are considering the logarithm of the price level, we 
obtain growth rates by taking first differences. Therefore, the differenced price 
level series is the series of quarterly inflation rates. This is quickly done in R 
using the function Delt () from the package quantmod. As explained in Chapter 
14.2, multiplying the quarterly inflation rates by 400 yields the quarterly rate 
of inflation, measured in percentage points at an annual rate. 


# plot U.S. PCE price inflation 
plot(400 * Delt (PCECTPI), 
main = "United States PCE Price Index", 


ylab = "Percent per annum", 
col = "steelblue", 
lwd = 2) 
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# add a dashed line at y= O 
abline(0O, 0, lty = 2) 
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The inflation rate behaves much more erratically than the smooth graph of the 
logarithm of the PCE price index. 


The DF-GLS Test for a Unit Root 


The DF-GLS test for a unit root has been developed by Elliott et al. (1996) and 
has higher power than the ADF test when the autoregressive root is large but 
less than one. That is, the DF-GLS has a higher probability of rejecting the 
false null of a stochastic trend when the sample data stems from a time series 
that is close to being integrated. 


The idea of the DF-GLS test is to test for an autoregressive unit root in the 
detrended series, whereby GLS estimates of the deterministic components are 
used to obtain the detrended version of the original series. See Chapter 16.3 of 
the book for a more detailed explanation of the approach. 


A function that performs the DF-GLS test is implemented in the package urca 
(this package is a dependency of the package vars so it should be already loaded 
if vars is attached). The function that computes the test statistic is ur. ers. 


# DF-GLS test for unit root in GDP 

summary (ur .ers(log(window(GDP, start = c(1962, 1), end = c(2012, 4))), 
model = "trend", 
lag.max = 2)) 

#> 

E> HEHHHEHHAHAHEHHEARAHAAEHAEAAAAAHARAARA AAA AAR 

#> # Elliot, Rothenberg and Stock Unit Root Test # 
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A> HHEBHHERHHERAEHHEHAAAAHARHARAHAAHARAAAAAAAARRAR 

#> 

#> Test of type DF-GLS 

#> detrending of series with intercept and trend 

#> 

#> 

#> Call: 

#> lm(formula = dfgls.form, data = data.dfgls) 

#> 

#> Residuals: 

#> Min 1Q Median OU) Maz 

#> -0.025739 -0.004054 0.000017 0.004619 0.033620 

#> 

#> Coefficients: 

#> Estimate Std. Error t value Pr(>/t/) 

#> yd. lag -0.01213 07010120 171990728207 
#> yd.diff.lag1 0.28583 0.07002 4.082 6.47e-05 *** 
#> yd.diff.lag2 0.19320 0.07058 2.737 0.00676 ** 
SoS 


HO stgnif... codes: 0 '#**! OU001 '** 0001 "#2 0205 4. Ont Yd 
#> 

#> Residual standard error: 0.007807 on 198 degrees of freedom 

#> Multiple R-squared: 0.1504, Adjusted R-squared: 0.1376 


#> F-statistic: 11.69 on 3 and 198 DF, p-value: 4.392e-07 
#> 

#> 

#> Value of test statistice ts: 1 1987 

#> 

#> Critical values of DF-GLS are: 

#> ipet Spct 10pct 

#> critical values -3:48 =2.89 2.57 


The summary of the test shows that the test statistic is about —1.2. The the 
10% critical value for the DF-GLS test is —2.57. This is, however, not the 
appropriate critical value for the ADF test when an intercept and a time trend 
are included in the Dickey-Fuller regression: the asymptotic distributions of 
both test statistics differ and so do their critical values! 


The test is left-sided so we cannot reject the null hypothesis that U.S. inflation 
is nonstationary, using the DF-GLS test. 
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16.3 Cointegration 


Key Concept 16.5 
Cointegration 


When X, and Y, are J(1) and if there is a 0 such that Y, — 0X; 
is 1(0), X, and Y, are cointegrated. Put differently, cointegration 
of X, and Y, means that X; and Y, have the same or a common 
stochastic trend and that this trend can be eliminated by taking a 
specific difference of the series such that the resulting series is stationary. 


R functions for cointegration analysis are implemented in the package 
urca. 





As an example, reconsider the the relation between short- and long-term interest 
rates by the example of U.S. 3-month treasury bills, U.S. 10-years treasury bonds 
and the spread in their interest rates which have been introduced in Chapter 
14.4. The next code chunk shows how to reproduce Figure 16.2 of the book. 


# reproduce Figure 16.2 of the book 


# plot both interest series 

plot (merge (as.zoo(TB3MS), as.zoo(TB10YS)), 
plot-type = singlet, 
ly = e2, D, 


lwd = 2, 
xlab = "Date", 
ylab = "Percent per annum", 


vals, = as Oe 
main = "Interest Rates") 


# add the term spread series 
lines (as.zoo(TSpread) , 


col = "steelblue", 

lwd = 2, 

xlab = "Date", 

ylab = "Percent per annum", 


main = "Term Spread") 

# shade the term spread 

polygon(c(time(TB3MS), rev(time(TB3MS))), 
c(TB10YS, rev(TB3MS)), 
col = alpha("steelblue", alpha = 0.3), 
border = NA) 
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# add horizontal line add O 
abline(0, 0) 


# add a legend 
legend("topright", 
legend = c("TB3MS", "TB10YS", "Term Spread"), 
col = ‘c@iblack!.) “billacki") “siteeilibiuel))n 
iyd = 2, 5 BD) 
ity = c2, 1, 10) 
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The plot suggests that long-term and short-term interest rates are cointegrated: 
both interest series seem to have the same long-run behavior. They share a 
common stochastic trend. The term spread, which is obtained by taking the 
difference between long-term and short-term interest rates, seems to be sta- 
tionary. In fact, the expectations theory of the term structure suggests the 
cointegrating coefficient 0 to be 1. This is consistent with the visual result. 


Testing for Cointegration 


Following Key Concept 16.5, it seems natural to construct a test for cointegra- 
tion of two series in the following manner: if two series X; and Y; are cointe- 
grated, the series obtained by taking the difference Y; — 0X, must be stationary. 
If the series are not cointegrated, Y; — 0X, is nonstationary. This is an assump- 
tion that can be tested using a unit root test. We have to distinguish between 
two cases: 


1. 6 is known. 


Knowledge of 0 enables us to compute differences z = Y; — 0X; so that 
Dickey-Fuller and DF-GLS unit root tests can be applied to z;. For these 
tests, the critical values are the critical values of the ADF or DF-GLS test. 
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2. 0 is unknown. 


If 0 is unknown, it must be estimated before the unit root test can be 
applied. This is done by estimating the regression 


Y¥,=a+0X,4+ % 


using OLS (this is refered to as the first-stage regression). Then, a Dickey- 
Fuller test is used for testing the hypothesis that z; is a nonstationary 
series. This is known as the Engle-Granger Augmented Dickey-Fuller test 
for cointegration (or EG-ADF test) after Engle and Granger (1987). The 
critical values for this test are special as the associated null distribution is 
nonnormal and depends on the number of I(1) variables used as regressors 
in the first stage regression. You may look them up in Table 16.2 of the 
book. When there are only two presumably cointegrated variables (and 
thus a single (1) variable is used in the first stage OLS regression) the 
critical values for the levels 10%, 5% and 1% are —3.12, —3.41 and —3.96. 


Application to Interest Rates 


As has been mentioned above, the theory of the term structure suggests that 
long-term and short-term interest rates are cointegrated with a cointegration 
coefficient of 0 = 1. In the previous section we have seen that there is visual 
evidence for this conjecture since the spread of 10-year and 3-month interest 
rates looks stationary. 


We continue by using formal tests (the ADF and the DF-GLS test) to see 
whether the individual interest rate series are integrated and if their difference 
is stationary (for now, we assume that 6 = 1 is known). Both is conveniently 
done by using the functions ur.df() for computation of the ADF test and 
ur.ers for conducting the DF-GLS test. Following the book we use data from 
1962:Q1 to 2012:Q4 and employ models that include a drift term. We set the 
maximum lag order to 6 and use the AIC for selection of the optimal lag length. 


# test for nonstationarity of 3-month treasury bills using ADF test 
ur .df(window(TB3MS, c(1962, 1), c(2012, 4)), 

lags = 6; 

selectlags = "AIC", 

type = "drift") 
#> 
eS LE REE E E EE RE EE E EEE IEE IESE IIE 
#> # Augmented Dickey-Fuller Test Unit Root / Cointegration Test # 
H> PUPUNE EI EE I RRUAN ARARAU EKARU IEEE IESE 
#> 
#> The value of the test statistic is: -2.1004 2.2385 


# test for nonstationarity of 10-years treasury bonds using ADF test 
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ur.df(window(TB10YS, c(1962, 1), c(2012, 4)), 

lags = 6, 

selectlags = "AIC", 

type = "drift") 
#> 
E> HHEHEHHHRAHAAAAAEHAAHAHRAAHARAARAAARARAAAAHAEHARAARAAAREAA AAA 
#> # Augmented Dickey-Fuller Test Unit Root / Cointegration Test # 
A> HHEEBHHEHHERAEHHEHAHAAHARHAAAEAAERAARARAE HARARE AAA AARA AE HAR ER 
#> 
#> The value of the test statistic is: -1.0079 0.5501 


# test for nonstationarity of 3-month treasury bills using DF-GLS test 
ur.ers(window(TB3MS, c(1962, 1), c(2012, 4)), 
model = "constant", 
lag.max = 6) 
#> 
H> HEEHHHEEHHAHEHAAHAAAARAAAAAHAAAAAAAAAAAAAAHAAAARAAAAEH AERA AAE 
#> # Elliot, Rothenberg and Stock Unit Root / Cointegration Test # 
> ee eee REEE ERE 
#> 
#> The value of the test statistic is: -1.8042 


# test for nonstationarity of 10-years treasury bonds using DF-GLS test 
ur.ers(window(TB10YS, c(1962, 1), c(2012, 4)), 

model = "constant", 

lag.max = 6) 


A> HHEBHHEHHERAEHHEHAHAAHABHAAAAAAERAARAHRAE HARARE AAAAARAAAE HARE 
#> # Elliot, Rothenberg and Stock Unit Root / Cointegration Test # 
A> HHEEBHHHAEAAEHHERAHAAHEABHAAAAAAERAARAHRAEHARHARA AAA EAAARARAA HARE 


#> The value of the test statistici is: -—0.942 


The corresponding 10% critical value for both tests is —2.57 so we cannot reject 
the null hypotheses of nonstationary for either series, even at the 10% level of 
significance.t We conclude that it is plausible to model both interest rate series 
as I(1). 


Next, we apply the ADF and the DF-GLS test to test for nonstationarity of 
the term spread series, which means we test for non-cointegration of long- and 
short-term interest rates. 





1Note: ur.df() reports two test statistics when there is a drift in the ADF regression. 
The first of which (the one we are interested in here) is the t-statistic for the test that the 
coefficient on the first lag of the series is 0. The second one is the t-statistic for the hypothesis 
test that the drift term equals 0. 
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# test if term spread is stationairy (cointegration of interest rates) using ADF 
ur.df(window(TB10YS, c(1962, 1), c(2012, 4)) - window(TB3MS, c(1962, 1), c(2012 ,4)), 
lags 6; 
selectlags = "AIC", 
type = "drift") 


A> HEEBHHEHHHEAAEHAEHAHAAHARAARAHAAEAAAAAAAA ARAB EHAE AAR AAA RARE AR A 
#> # Augmented Dickey-Fuller Test Unit Root / Cointegration Test # 
A> HHEEBHHEHHHERAEHAAHAHAAHARAARAHAAAAAAAAHAA HARARE HARA AAA AR 


#> The value of the test statistic is: -3.9308 7.7362 


# test if term spread is stationairy (cointegration of interest rates) using DF-GLS te 

ur.ers(window(TB10YS, c(1962, 1), c(2012, 4)) - window(TB3MS, c(1962, 1), c(2012, 4)), 
model = "constant", 
lag.max = 6) 

#> 

ED PARRUAN EEE EEE EEA SUE EEE EE EEE ELIE IEEE EI IIe 

#> # Elliot, Rothenberg and Stock Unit Root / Cointegration Test # 

E> HEELERS EE EEE SET IEEE IEE IEEE E E III EI 

#> 

#> The value of the test statistic rs: =3.6576 


Table 16.1 summarizes the results. 


Table 16.1: ADF and DF-GLS Test Statistics for Interest Rate 





Series 

Series ADF Test Statistic DF-GLS Test Statistic 
TB3MS —2.10 —1.80 
TB10YS —1.01 —0.94 


TBI10YS - TB3MS —3.93 —3.86 


Both tests reject the hypothesis of nonstationarity of the term spread series at 
the 1% level of significance, which is strong evidence in favor of the hypothesis 
that the term spread is stationary, implying cointegration of long- and short- 
term interest rates. 


Since theory suggests that 6 = 1, there is no need to estimate 0 so it is not 
necessary to use the EG-ADF test which allows 6 to be unknown. However, 
since it is instructive to do so, we follow the book and compute this test statistic. 
The first-stage OLS regression is 


TB10Y S, = Bo + Bi TB3MS; + 2. 
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# estimate first-stage regression of EG-ADF test 

FS_EGADF <- dynlm(window(TB10YS, c(1962, 1), c(2012, 4)) ~ window(TB3MS, c(1962, 1), c(2012, 4))) 
FS_EGADF 

#> 

#> Time series regression with "ts" data: 

#> Start = 1962(1), End = 2012(4) 

#> 

#> Call: 

#> dynlm(formula = window(TB10YS, c(1962, 1), c(2012, 4)) ~ window(TB3MS, 

#> c(1962, 1), c(2012, 4))) 

#> 

#> Coefficients: 

#> (Intercept) window(TB3MS, c(1962, 1), c(2012, 4)) 
#> 2.4642 0.8147 


Thus we have 


TB10Y S; = 2.46 + 0.81-TB3MS;, 


where f = 0.81. Next, we take the residual series {Z;} and compute the ADF 
test statistic. 


# compute the residuals 
z_hat <- resid(FS_EGADF) 


# compute the ADF test statistic 

ur.df(z_hat, lags = 6, type = "none", selectlags = "AIC") 

#> 

> Ee ee ee eee ee a E EEEE I LE IE NIE ELIE IEE! 
#> # Augmented Dickey-Fuller Test Unit Root / Cointegration Test # 
ee ee ee EE TERE 
#> 

#> The value of the test statistic is: —3.1935 


The test statistic is —3.19 which is smaller than the 10% critical value but 
larger than the 5% critical value (see Table 16.2 of the book). Thus, the null 
hypothesis of no cointegration can be rejected at the 10% level but not at the 
5% level. This indicates lower power of the EG-ADF test due to the estimation 
of 6: when 6 = 1 is the correct value, we expect the power of the ADF test for 
a unit root in the residuals series z = TB10Y S — TB3MS to be higher than 
when some estimate 0 is used. 


A Vector Error Correction Model for TB10Y S, and TB3 MS If two 
I(1) time series X; and Y; are cointegrated, their differences are stationary and 
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can be modeled in a VAR which is augmented by the regressor Y;_1 — 0X1. 
This is called a vector error correction model (VECM) and Y; — 0X; is called 
the error correction term. Lagged values of the error correction term are useful 
for predicting AX; and/or AY;. 


A VECM can be used to model the two interest rates considered in the previous 
sections. Following the book we specify the VECM to include two lags of both 
series as regressors and choose 6 = 1, as theory suggests (see above). 


TB10YS <- window(TB10YS, c(1962, 1), c(2012 ,4)) 
TB3MS <- window(TB3MS, c(1962, 1), c(2012, 4)) 


# set up error correction term 
VECM_ECT <- TB10YS - TB3MS 


# estimate both equations of the VECM using 'dynlm()' 
VECM_EQ1 <- dynlm(d(TB10YS) ~ L(d(TB3MS), 1:2) + L(d(TB10YS), 1:2) + L(VECM_ECT)) 
VECM_EQ2 <- dynlm(d(TB3MS) ~ L(d(TB3MS), 1:2) + L(d(TB10YS), 1:2) + L(VECM_ECT)) 





# rename regressors for better readability 

names (VECM_EQ1$coefficients) <- c("Intercept", "D_TB3MS_11", "D_TB3MS_12", 
"D TB10YS_11", "D_TB10YS_12", "ect_11") 

names (VECM_EQ2$coefficients) <- names(VECM_EQ1$coefficients) 


# coefficient summaries using HAC standard errors 

coeftest (VECM_EQ1, vcov. = NeweyWest(VECM_EQ1, prewhite = F, adjust = T)) 
#> 

#> t test of coeffzerents: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 

#> Intercept 0.1227089 0.0551419 2.2253 0.027205 * 

#> D_TB3MS_1l1 -0.0016601 . 0727060 -0.0228 0.981807 

#> D_TBƏMS_l2 -0.0680845 0.0435059 -1.5649 0.119216 

#> D_TB1OYS_l1 0.2264878 00957071 2.3665 0.018939 * 


O OOOO 9S 
hit 
ES 


#> D_TB10OYS_12 -0.0734486 0.0703476 -1.0441 0.297740 

#> ect li -0. 0878871 . 0285644 -3.0768 0.002393 ** 

#> -—- 

#> Signi- codes: (Oo Lk TOTOO te" O ONE O OSTEO E] 
coeftest (VECM_EQ2, vcov. = NeweyWest(VECM_EQ2, prewhite = F, adjust = T)) 
#> 

#> t test of coeffrerents: 

#> 

#> Estimate Std. Error t value Pr(>/t/) 


#> Intercept -0.060746 0.107937 -0.5628 0.57422 
#> D_TB3MS_l1 0.240003 0.111611 2.1504 0.03276 * 
#> D_TB3MS_l2 -0.155883 0.153845 -1.0132 0.31220 
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#> D_TB1IOYS_l1 0.113740 0.125571 0.9058 0.36617 
#> D_TB10YS_l2 -0.147519 0.112630 -1.3098 0.19182 


#> ect_l1 0.031506 0.050519 0.6236 0.53359 
ja 
Te Beie codes: O exw" O00 “ea! (0201 e a a Oa ad 


Thus the two estimated equations of the VECM are 


ATB3MS, = — 0.06 + 0.24ATB3MS;_1 — 0.16 ATB3MS;_» 
(0.11) (0.11) 


+ ao AT B10Y S:_1 — IgA BIOYS.- + ia ee 


ATBI0Y S; = = 0.12 — 0.00OATB3MS;_1 — 0. U ATEEN Di 2 
(0.06) (0.07) (0. 


+ 0.23 AT B10Y S}_1 — eee — 0.09 ECT;—1. 
(0.10) (0.07) (0.03) 


The output produced by coeftest() shows that there is little evidence that 
lagged values of the differenced interest series are useful for prediction. This 
finding is more pronounced for the equation of the differenced series of the 
3-month treasury bill rate, where the error correction term (the lagged term 
spread) is not significantly different from zero at any common level of signif- 
icance. However, for the differenced 10-years treasury bonds rate the error 
correction term is statistically significant at 1% with an estimate of —0.09. This 
can be interpreted as follows: although both interest rates are nonstationary, 
their conintegrating relationship allows to predict the change in the 10-years 
treasury bonds rate using the VECM. In particular, the negative estimate of 
the coefficient on the error correction term indicates that there will be a nega- 
tive change in the next period’s 10-years treasury bonds rate when the 10-years 
treasury bonds rate is unusually high relative to the 3-month treasury bill rate 
in the current period. 


16.4 Volatility Clustering and Autoregressive 
Conditional Heteroskedasticity 


Financial time series often exhibit a behavior that is known as volatility cluster- 
ing: the volatility changes over time and its degree shows a tendency to persist, 

e., there are periods of low volatility and periods where volatility is high. 
Econometricians call this autoregressive conditional heteroskedasticity. Condi- 
tional heteroskedasticity is an interesting property because it can be exploited 
for forecasting the variance of future periods. 


As an example, we consider daily changes in the Whilshire 5000 stock index. 
The data is available for download at the Federal Reserve Economic Data Base. 
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For consistency with the book we download data from 1989-29-12 to 2013-12-31 
(choosing this somewhat larger time span is necessary since later on we will be 
working with daily changes of the series). 


The following code chunk shows how to format the data and how to reproduce 
Figure 16.3 of the book. 


# import data on the Wilshire 5000 index 
W5000 <- read.csv2("data/Wilshire5000.csv", 
stringsAsFactors = F, 


header = T, 
sep = win 
na strings = 4.1) 


# transform the columns 
W5000$DATE <- as.Date(W5000$DATE) 
W5000$WILL5000INDFC <- as.numeric(W5000$WILL5000INDFC) 


# remove NAs 
W5000 <- na.omit (W5000) 


# compute daily percentage changes 
W5000_PC <- data.frame("Date" = W5000$DATE, 

"Value" = as.numeric(Delt (W5000$WILL5000INDFC) * 100)) 
W5000_PC <- na.omit (W5000_PC) 


# plot percentage changes 
plot (W5000_PC, 
ylab = "Percent", 


main = "Daily Percentage Changes", 
type="1" : 

col = "steelblue", 

lwd = 0.5) 


# add horizontal line at y = 0 
abline(0, 0) 


The series of daily percentage changes in the Wilshire index seems to randomly 
fluctuate around zero, meaning there is little autocorrelation. This is confirmed 
by a plot of the sample autocorrelation function. 


# plot sample autocorrelation of daily percentage changes 
acf(W5000_PC$Value, main = "Wilshire 5000 Series") 


In Figure 16.2 we see that autocorrelations are rather weak so that it is difficult 
to predict future outcomes using, e.g., an AR model. 
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Figure 16.1: Daily Percentage Returns in the Wilshire 5000 Indrx 


However, there is visual evidence in 16.1 that the series of returns exhibits 
conditional heteroskedasticity since we observe volatility clustering. For some 
applications it is useful to measure and forecast these patterns. This can be 
done using models which assume that the volatility can be described by an 
autoregressive process. 


ARCH and GARCH Models 


Consider 

Y; = bo + b1Yt—1 + 1 Xt-1 + Ue; 
an ADL(1,1) regression model. The econometrician Robert Engle (1982) pro- 
posed to model o? = Var(ur|ut—1,Ut-2,..--), the conditional variance of the 
error uz given its past, by an order p distributed lag model, 


oF = ao + aru? i + azu? o ++ Aput- p» (16.1) 


called the autoregressive conditional heteroskedasticity (ARCH) model of order 
p, or short ARCH(p).? We assume ap > 0 and a@1,...,Qp > 0 to ensure a 
positive variance o? > 0. The general idea is apparent from the model structure: 
positive coefficients a9,a1,...,@, imply that recent large squared errors lead to 
a large variance, and thus large squared errors, in the current period. 


The generalized ARCH (GARCH) model, developed by Tim Bollerslev (1986), 
is an extension of the ARCH model, where o? is allowed to depend on its own 
lags and lags of the squared error term. The GARCH(p,q) model is given by 





2 2 2 2 2 2 
Oy = Qo + Qiu + Qoty_gt+++ + Opus» t G10¢_1 + +++ + Opra (16.2) 





2 Although we introduce the ARCH model as a component in an ADL(1,1) model, it can 
be used for modelling the conditional zero-mean error term of any time series model. 
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Figure 16.2: Autocorrelation in Daily Price Changes of W5000 Index 


The GARCH model is an ADL(p,qg) model and thus can provide more parsimo- 
nious parameterizations than the ARCH model (see the discussion in Appendix 
15.2 of the book). 


Application to Stock Price Volatility 


Maximum likelihood estimates of ARCH and GARCH models are efficient and 
have normal distributions in large samples, such that the usual methods for 
conducting inference about the unknown parameters can be applied. The R 
package fGarch is a collection of functions for analyzing and modelling the het- 
eroskedastic behavior in time series models. The following application repro- 
duces the results presented in Chapter 16.5 of the book by means of the function 
garchFit(). This function is somewhat sophisticated. It allows for different 
specifications of the optimization procedure, different error distributions and 
much more (use ?GarchFit for a detailed description of the arguments). In 
particular, the reported standard errors reported by garchFit() are robust. 


The GARCH(1,1) model of daily changes in the Wilshire 5000 index we estimate 
is given by 
Ri = Bo T Ut, Ut ~ N (0,07), 


oF = ao +ou? a + ¢107_4 (16.4) 





where R; is the percentage change in period t. 89, Qo, a; and ¢; are unknown 
coefficients and uz: is an error term with conditional mean zero. We do not 
include any lagged predictors in the equation of R, because the daily changes in 
the Wilshire 5000 index reflect daily stock returns which are essentially unpre- 
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dictable. Note that u, is assumed to be normally distributed and the variance 
o? depends on t as it follows the GARCH(1,1) recursion (16.4). 


It is straightforward to estimate this model using garchFit(). 


# estimate GARCH(1,1) model of daily percentage changes 
GARCH_Wilshire <- garchFit(data = W5000_PC$Value, trace = F) 


We obtain 
R, = 0.068, (16.5) 
(0.010) 
G7 = 0.011 + 0.081 u?_, + 0.90902], (16.6) 
(0.002) (0.007) (0.008) 


so the coefficients on u?_, and o?_, are statistically significant at any common 
level of significance. One can show that the persistence of movements in a? is 
determined by the sum of both coefficients, which is 0.99 here. This indicates 
that movements in the conditional variance are highly persistent, implying long- 
lasting periods of high volatility which is consistent with the visual evidence for 
volatility clustering presented above. 


The estimated conditional variance G7? can be computed by plugging the 


residuals from (16.5) into equation (16.6). This is performed automatically 
by garchFit(), so to obtain the estimated conditional standard deviations 
G; we only have to read out the values from GARCH_Wilshire by appending 
\@sigma.t. 





Using the o; we plot bands of + one conditional standard deviation along with 
deviations of the series of daily percentage changes in the Wilshire 5000 index 
from its mean. The following code chunk reproduces Figure 16.4 of the book. 


# compute deviations of the percentage changes from their mean 
dev_mean_W5000_PC <- W5000_PC$Value - GARCH_Wilshire@fit$coef [1] 


# plot deviation of percentage changes from mean 
plot (W5000_PC$Date, dev_mean_W5000_PC, 
type = "1", 
col = "steelblue", 
ylab = "Percent", 
xlab = "Date", 
main = "Estimated Bands of +- One Conditional Standard Deviation", 


lwd = 0.2) 


# add horizontal line at y = O 
abline(0, 0) 


# add GARCH(1,1) confidence bands (one standard deviation) to the plot 
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lines (W5000_PC$Date, 
GARCH_Wilshire@fit$coef[1] + GARCH_Wilshire@sigma.t, 
col = "darkred", 
lwd = 0.5) 


lines (W5000_PC$Date, 
GARCH_Wilshire@fit$coef[1] - GARCH_Wilshire@sigma.t, 
col = "darkred", 
lwd = 0.5) 
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The bands of the estimated conditional standard deviations track the observed 
heteroskedasticity in the series of daily changes of the Wilshire 5000 index quite 
well. This is useful for quantifying the time-varying volatility and the resulting 
risk for investors holding stocks summarized by the index. Furthermore, this 
GARCH model may also be used to produce forecast intervals whose widths 
depend on the volatility of the most recent periods. 


Summary 


e We have discussed how vector autoregressions are conveniently estimated 
and used for forecasting in R by means of functions from the vars package. 


e The package urca supplies advanced methods for unit root and cointegra- 
tion analysis like the DF-GLS and the EG-ADF tests. In an application 
we have found evidence that 3-months and 10-year interest rates have a 
common stochastic trend (that is, they are cointegrated) and thus can be 
modeled using a vector error correction model. 


e Furthermore, we have introduced the concept of volatility clustering and 
demonstrated how the function garchFit() from the package fGarch can 
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be employed to estimate a GARCH(1,1) model of the conditional het- 
eroskedasticity inherent to returns of the Wilshire 5000 stock index. 
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